Vector DB and GenAI Stack


Vector DB is an important part of genAI stack. It is useful for word embedding. 

In simple terms, "Word embedding" : encode each word from training set as vector. It is a representation of a word. It encodes the meaning of the word in such a way that words that are closer in the vector space are expected to be similar in meaning. It is useful for syntactic parsing and sentiment analysis. The whole vocabulary is vector DB. It is useful for sequence prediction.

How Word Embedding works? The OpenAI word embedding model lets you take any string of text (up to a ~8,000 word length limit) and turn that into a vector with 1536 dimension. So word has 1,536 floating point numbers as attributes. These floating point numbers are derived from a sophisticated language model. They take a vast amount of knowledge of human language and flatten that down to a list of floating point numbers. 4 bytes per floating point number that’s 4*1,536 = 6,144 bytes per word embedding—6KiB.

Here is the curl command to invoke API

curl https://api.openai.com/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{"input": "Your text string goes here",
       "model":"text-embedding-ada-002"}'
https://platform.openai.com/docs/api-reference/embeddings/object

The word embedding is stored in vector index

Use Cases

  • Search (where results are ranked by relevance to a query string)
  • Clustering (where text strings are grouped by similarity)
  • Recommendations (where items with related text strings are recommended)
  • Anomaly detection (where outliers with little relatedness are identified)
  • Diversity measurement (where similarity distributions are analyzed)
  • Classification (where text strings are classified by their most similar label)

Similarity 

1. cosine similarity

It only gives direction

cosine = dot product / magnitude

= (x1*y1 + x2*y2 +.... x1536*y1536) / (square root of (x1*x1 + x2*x2 +... x1536*x1536) * square root of (y1*y1 + y2*y2 +... y1536*y1536) )

Facebook AI research has optimized version https://github.com/facebookresearch/faiss

https://en.wikipedia.org/wiki/Cosine_similarity

2. Euclidian similarity  


Vector DB Example:  

  • Chroma, an open-source embeddings store
  • Elasticsearch, a popular search/analytics engine and vector database
  • Milvus, a vector database built for scalable similarity search
  • Pinecone, a fully managed vector database
  • Qdrant, a vector search engine
  • Redis as a vector database
  • Typesense, fast open source vector search
  • Weaviate, an open-source vector search engine
  • Zilliz, data infrastructure, powered by Milvus
  • drant, activeloop, pgvector, momento, Neo4j, Casandara (CaasIO library)
  • Postgres https://innerjoin.bit.io/vector-similarity-search-in-postgres-with-bit-io-and-pgvector-c58ac34f408b
Here is link that contains links to many relevant videos and text
https://simonwillison.net/2023/Oct/23/embeddings/

History

1990+ was about RDBMS. It is for transaction processing and reporting
2005+ was about NoSQL. Simple model for large scale store and retrieve
2013+ is about Graph DB

RDBMS v/s Graph DB
1. rows are nodes
2. joins are relationships 
3. name of tables are labels
4. columns are properties

(c:node1)-[r:relationship1]->(p:node2)


Neo4j


Cypher is a query language for graph DB
LangChain has "Cypher Search Chain" module to interact with Neo4J DB

https://github.com/tomasonjo/blogs/blob/master/llm/langchain_neo4j.ipynb
https://python.langchain.com/docs/use_cases/graph/graph_cypher_qa
https://towardsdatascience.com/integrating-neo4j-into-the-langchain-ecosystem-df0e988344d2

Neo4J has "Northwind Graph" DB

At GenAI stack, Vector DB (or Graph DB) is used for RAG

* First application connects to DB and obtains schema
* Then application sends question and schema to LLM. LLM returns cypher statement
* Then application execute the cypher statement on DB and gets the result
* Then application pass that result to LLM with instructions
* This how, LLM response to user. 


Another diagram, depicts the specific data and user question is converted to word embeddings. 



Here is an example of creating Vector Search Index

CALL db.index.vector.createNodeIndex(
    'moviePlots', // name of index
    'Movie', // node label
    'embedding', // property 
    1536, //dimension of embedding
    'cosine' // similarity function
)

Reference: https://medium.com/neo4j/knowledge-graphs-llms-fine-tuning-vs-retrieval-augmented-generation-30e875d63a35

https://github.com/neo4j/NaLLM/tree/1af09cd117ba0777d81075c597a5081583568f9f

https://atoonk.medium.com/diving-into-ai-an-exploration-of-embeddings-and-vector-databases-a7611c4ec063

There are many visualization tools for Neo4j database
https://neo4j.com/developer/tools-graph-visualization/
https://medium.com/neo4j/tagged/data-visualization
https://neo4j.com/developer-blog/15-tools-for-visualizing-your-neo4j-graph-database/
https://neo4j.com/graph-visualization-neo4j/

New GenAI stack

https://collabnix.com/getting-started-with-genai-stack-powered-with-docker-langchain-neo4j-and-ollama/

https://github.com/docker/genai-stack

https://www.docker.com/press-release/neo4j-langchain-ollama-launches-new-genai-stack-for-developers/

https://neo4j.com/developer-blog/genai-app-how-to-build/

https://www.youtube.com/watch?v=fWUzSMzSAU0

Contianers

1. langchain bot.py (streamlit for UI) FastAPI, sevlte

It has application logic and data flows

2. ollama sentence_transformer

3. neo4j

4. ollama llama2

Ollama manage local LLMs

it seems llm is another tool similar to ollama

https://datasette.io/tools/llm

Another GenAI stack, where (1) GPT4 is used instead of ollama+llama2 and (2) ChromaDB is used instead of Neo4J https://medium.com/@rubentak/unleashing-the-power-of-intelligent-chatbots-with-gpt-4-and-vector-databases-a-step-by-step-8027e2ce9e78 and https://github.com/rubentak/Langchain/blob/main/notebooks/Langchain_doc_chroma.ipynb

0 comments:

Post a Comment