Vector DB and GenAI Stack
Vector DB is an important part of genAI stack. It is useful for word embedding.
In simple terms, "Word embedding" : encode each word from training set as vector. It is a representation of a word. It encodes the meaning of the word in such a way that words that are closer in the vector space are expected to be similar in meaning. It is useful for syntactic parsing and sentiment analysis. The whole vocabulary is vector DB. It is useful for sequence prediction.
How Word Embedding works? The OpenAI word embedding model lets you take any string of text (up to a ~8,000 word length limit) and turn that into a vector with 1536 dimension. So word has 1,536 floating point numbers as attributes. These floating point numbers are derived from a sophisticated language model. They take a vast amount of knowledge of human language and flatten that down to a list of floating point numbers. 4 bytes per floating point number that’s 4*1,536 = 6,144 bytes per word embedding—6KiB.
Here is the curl command to invoke API
curl https://api.openai.com/v1/embeddings \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -d '{"input": "Your text string goes here", "model":"text-embedding-ada-002"}'
https://platform.openai.com/docs/api-reference/embeddings/object
The word embedding is stored in vector index
Use Cases
- Search (where results are ranked by relevance to a query string)
- Clustering (where text strings are grouped by similarity)
- Recommendations (where items with related text strings are recommended)
- Anomaly detection (where outliers with little relatedness are identified)
- Diversity measurement (where similarity distributions are analyzed)
- Classification (where text strings are classified by their most similar label)
Similarity
1. cosine similarity
It only gives direction
cosine = dot product / magnitude
= (x1*y1 + x2*y2 +.... x1536*y1536) / (square root of (x1*x1 + x2*x2 +... x1536*x1536) * square root of (y1*y1 + y2*y2 +... y1536*y1536) )
Facebook AI research has optimized version https://github.com/facebookresearch/faiss
https://en.wikipedia.org/wiki/Cosine_similarity
2. Euclidian similarity
Vector DB Example:
- Chroma, an open-source embeddings store
- Elasticsearch, a popular search/analytics engine and vector database
- Milvus, a vector database built for scalable similarity search
- Pinecone, a fully managed vector database
- Qdrant, a vector search engine
- Redis as a vector database
- Typesense, fast open source vector search
- Weaviate, an open-source vector search engine
- Zilliz, data infrastructure, powered by Milvus
- drant, activeloop, pgvector, momento, Neo4j, Casandara (CaasIO library)
- Postgres https://innerjoin.bit.io/vector-similarity-search-in-postgres-with-bit-io-and-pgvector-c58ac34f408b
https://collabnix.com/getting-started-with-genai-stack-powered-with-docker-langchain-neo4j-and-ollama/
https://github.com/docker/genai-stack
https://www.docker.com/press-release/neo4j-langchain-ollama-launches-new-genai-stack-for-developers/
https://neo4j.com/developer-blog/genai-app-how-to-build/
https://www.youtube.com/watch?v=fWUzSMzSAU0
Contianers
1. langchain bot.py (streamlit for UI) FastAPI, sevlte
It has application logic and data flows
2. ollama sentence_transformer
3. neo4j
4. ollama llama2
Ollama manage local LLMs
it seems llm is another tool similar to ollama
https://datasette.io/tools/llm
Another GenAI stack, where (1) GPT4 is used instead of ollama+llama2 and (2) ChromaDB is used instead of Neo4J https://medium.com/@rubentak/unleashing-the-power-of-intelligent-chatbots-with-gpt-4-and-vector-databases-a-step-by-step-8027e2ce9e78 and https://github.com/rubentak/Langchain/blob/main/notebooks/Langchain_doc_chroma.ipynb
0 comments:
Post a Comment