Pinecone

A vector database wrapper for Pinecone, integrated into the EazyML GenAI framework.

This class provides an interface to interact with the Pinecone vector database by extending the generic VectorDB class. It sets the vector DB type and initializes a Pinecone client using the provided API key.

class PineconeDB(**kwargs)

Bases: VectorDB

Initializes the PineconeDB instance.

Args:

kwargs: Dictionary of connection parameters like url and api_key.

Example using API_KEY :

from eazyml_genai.components import PineconeDB
# initialize pinecone vector database
pinecone_db = PineconeDB(api_key=os.getenv("PINECONE_API_KEY"))

# index document, mention collection name and documents
# Give supported text embedding model from Hugginface, Google and OpenAI.
indexed_documents = pinecone_db.index_documents(collection_name="USER DEFINED COLLECTION NAME",
                        documents="JSON DOCUMENTS USING PDF LOADER",
                        text_embedding_model=GoogleEmbeddingModel.TEXT_EMBEDDING_004,
                        )

# retrieve relevant document for given question.
total_hits = pinecone_db.retrieve_documents("YOUR QUESTION", collection_name="YOUR COLLECTION NAME", top_k=5)

Example using Docker Compose :

The provided docker-compose.yml file can be utilized to create and run a Pinecone-supported Docker image.

services:
    pinecone:
        image: ghcr.io/pinecone-io/pinecone-local:latest
        environment: 
        PORT: 5080
        PINECONE_HOST: localhost
        ports: 
        - "5080-5090:5080-5090"
        platform: linux/amd64

Presented below is a Python script for utilizing Pinecone with Docker

from eazyml_genai.components import PineconeDB
# initialize pinecone vector database
pinecone_db = PineconeDB(api_key='any_api_key', host="http://localhost:5080")

# index document, mention collection name and documents
# Give supported text embedding model from Hugginface, Google and OpenAI.
indexed_documents = pinecone_db.index_documents(collection_name="USER DEFINED COLLECTION NAME",
                        documents="JSON DOCUMENTS USING PDF LOADER",
                        text_embedding_model=GoogleEmbeddingModel.TEXT_EMBEDDING_004,
                        )

# retrieve relevant document for given question.
total_hits = pinecone_db.retrieve_documents("YOUR QUESTION", collection_name="YOUR COLLECTION NAME", top_k=5)

index_documents(collection_name, documents, namespace='', **kwargs)

Indexes a list of documents into separate dense and sparse vector collections.

This method processes a list of dictionaries, where each dictionary represents a document and is expected to have at least a ‘content’ key and optionally a ‘title’ key. It generates both dense embeddings using a text embedding client and sparse TF-IDF vectors for each document. These vectors, along with the document’s metadata, are then stored in two separate collections (one for dense vectors and one for sparse vectors).

Args:

collection_name (str):: The base name for the dense and sparse vector collections that will be created or used. The actual collection names will be derived from this base name (e.g., ‘my_collection_dense’, ‘my_collection_sparse’).
documents (list[dict]):: A list of dictionaries, where each dictionary represents a document. Each document should have a ‘content’ key containing the text to be indexed. Optionally, a ‘title’ key can also be present and will be included in the text used for generating embeddings and sparse vectors. Any other key-value pairs in the document dictionary will be stored as metadata.
namespace (str, optional):: An optional namespace to apply when indexing the documents into the collections. Defaults to “”.
kwargs:: Additional keyword arguments that will be passed to the create_collection method. This can include parameters like the dimension of the dense embeddings.

retrieve_documents(collection_name, question, top_k=5, document_types=['text', 'table', 'image'], namespace='', **kwargs)

Retrieves documents relevant to a given question from both dense and sparse vector collections. This function performs a hybrid search, combining results from both dense and sparse vector retrieval methods to provide a more comprehensive set of relevant documents. It prevents duplicate documents from being returned. It also transforms the metadata.

Args:

question (str): The query question used to retrieve relevant documents.

collection_name (str, optional): The base name of the collections to query (both dense and sparse). If None, the default collection name (self.collection_name) is used. Defaults to None.

top_k (int, optional): The number of top-ranking documents to retrieve from each collection (dense and sparse). Defaults to 5.

document_types (list[str], optional): A list of document types to filter the retrieval results. Defaults to [‘text’, ‘table’, ‘image’].

namespace (str, optional): The namespace to use when querying the collections. Defaults to “”.

Returns:

list[dict]: A list of retrieved documents. Each document is a dictionary containing the following keys:

‘id’ (str): The unique identifier of the document.

‘score’ (float): The relevance score of the document to the query.

‘metadata’ (dict): A dictionary containing the document’s metadata, including ‘type’, ‘title’, ‘content’, ‘path’ (converted from string representation), and ‘meta’ (converted from string representation). Empty strings are used if values are None.