Login
Congrats in choosing to up-skill for your bright career! Please share correct details.
Home / Blog / Generative AI / What is Chroma DB: A Step-By-Step Guide
Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 18+ years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.
Table of Content
Are you ready to unlock the hidden language of vectors? To waltz with data in its most elegant form? If so, step into the Chroma verse, where the Chroma Database holds the key to a new reality of information retrieval.
Forget clunky keywords and rigid structures. Chroma speaks the fluent dialect of embeddings, those intricate tapestries woven from data points. Here, information isn't just stored, it shimmers with connections, waiting to be unearthed by the right query.
Imagine a vast library, not of words, but of meanings. Each book whispers its essence, not through letters, but through a symphony of vectors, dancing in high-dimensional space. Chroma is the librarian, the conductor, the translator. It guides you through this symphony, whispering suggestions, revealing hidden harmonies, and unearthing connections you never knew existed.
Whether you're a seasoned AI maestro, a curious data alchemist, or a novice explorer seeking the power of machine learning, Chroma welcomes you with open arms. Join us as we
360DigiTMG also offers the Data Science Course in Hyderabad to start a better career. Enroll now!
So, are you ready to embrace the Chroma revolution? Come, don your data dancing shoes, and let's explore the possibilities together. In the Chromaverse, the only limit is your imagination.
Chroma isn't just a database, it's a translator for the AI world. Imagine complex data like text, images, and audio – Chroma converts it into numerical patterns called "embeddings" that computers can understand. These embeddings are like secret maps, revealing hidden connections and meaning within the data.
Vector Database: A specialized database designed to store and manage numerical vectors, also known as embeddings. These vectors capture the essence of information, like text, images, and audio, in a machine-readable format.
Embedding: A mathematical representation that maps complex data into a dense vector of numbers, preserving relationships and semantic meaning. Think of it as a unique fingerprint for each piece of information.
Embeddings or Vector Embeddings is a way of representing data (be it text, images, audio, videos, etc) in the numerical format, to be precise it’s a way of representing data in the form of numbers in an n-dimensional space(a numerical vector). This way, embeddings allow us to cluster similar data together. There are models, that take these inputs and convert them into vectors. One such example is the Word2Vec, which is a popular embedding model developed by Google, that converts words to vectors(vectors are points having n-dimensions). All the Large Language Models have their respective embedding models, which create embeddings for their LLM.
The good thing about converting words to vectors is we can compare them. A computer cannot compare two words as they are, but if we give them in the form of numerical inputs, i.e. vector embeddings it can compare them. We can create a cluster of words having similar embeddings. The words King, Queen, Prince, and Princess will appear in a cluster because they are related to other.
This way embeddings allow us to get find words similar to a given word. We can incorporate this into sentences, where we input a sentence and obtain the related sentences from the provided data. This is the base for Semantic Search, Sentence Similarity, Anomaly Detection, chatbot, and many more use cases. The Chatbots we build to perform Question Answering from a given PDF, Doc, leverage this very concept of embeddings. All the Generative Large Language Models use this approach to get similarly related content to the queries provided to them.
As discussed, embeddings are representations of any kind of data usually, the unstructured ones in the numerical format in an n-dimensional space. Now where do we store them? Traditional RDMS (Relational Database Management Systems) cannot be used to store these vector embeddings. This is where the Vector Store / Vector Dabases come into play.. There are many Vector Stores out there, which differ by the embedding models they support and the kind of search algorithm they use to get similar vectors.
Why do we need them? We need them because they provide fast access to the data we need. Let’s consider a Chatbot based on a PDF. Now when a user enters a query, the first thing will be to fetch related content from PDF to that query and feed this information to the Chatbot. So that the Chatbot can take this information related to the query and proved the relevant answer to the User. Now how do we get the relevant content from PDF related to the User query? The answer is a simple similarity search
When data is represented in vector embeddings, we can find similarities between different parts of the data and extract the data similar to a particular embedding. The query is first converted to embeddings by an embedding model and then the Vector Store takes this vector embedding and then performs a similarity search (through search algorithms) between other embeddings that it has stored in its database and fetches all the relevant data. These relevant vector embeddings are then passed to the Large Language Model which is the chatbot that uses this information to generate a final answer to the User.
Chroma is a Vector Store / Vector DB by the company Chroma. Chroma DB like many other Vector Stores out there, is for storing and retrieving vector embeddings. The good part is that Chroma is a Free and Open Source project. This gives other skilled developers out there in the world the to give suggestions and make tremendous improvements to the Database and even one can expect a quick reply to an issue when dealing with Open Source software, as the whole Open Source community is out there to see and resolve that issue.
At present Chroma does not provide any hosting services. Store the data locally in the local file system when creating applications around Chroma. Though Chroma is planning to build a hosting service in the near future. Chroma DB offers different ways to store vector embeddings. You can store them In-memory, you can save and load them In-memory, you can just run Chroma a client to talk to the backend server. Overall Chroma DB has only 4 functions in the API, thus making it short, simple, and easy to get started with.
In this section, we will install Chroma and see all the functionalities it provides. Firstly, we will install the library through the pip command
This will download the Chroma Vector Store API for Python. With this package, we can perform all tasks like storing the vector embeddings, retrieving them, and performing a semantic search for a given vector embedding.
We will start off with creating a persistent in-memory database. The above code will create one for us. To create a client we take the Client() object from the Chroma DB. Now to create an in-memory database, we configure our client with the following parameters
This will create an in-memory Duck DB database with the parquet file format. And we provide the directory for where this data is to be stored. Here we are saving the database in the /content/ folder. So whenever we connect to a Chroma DB client with this configuration, the Chroma DB will look for an existing database in the directory provided and will load it. If it is not present then it will create it. And when we close the connection, the data will be saved to this directory.
Now, we will create a collection. Collection in Vector Store is where we save the set of vector embeddings, documents, and any metadata if present. Collection in a vector database can be thought of as a Table in Relational Database.
Looking forward to becoming a Data Scientist? Check out the Professional Course of Data Science Course in Bangalore and get certified today
We will now create a collection and add documents to it.
Here we start by creating a collection first. Here we name the collection “my_information”.
To this collection, we will be adding documents. Here we are adding 3 documents, in our case, we are just adding three sentences as three documents. The first document is about cars, the second one is about dogs and the final one is about four-wheelers.
We are even adding the metadata. Metadata for all three documents is provided.
Every document needs to have a unique ID to it, hence we are giving id1, id2, and id3 to them
All these are like the variables to the add() function from the collection
After running the code, add these documents to our collection “my_information”
We learned that the information stored in Vector Databases is in the form of Vector Embeddings. But here, we provided text/text files i.e. documents. So how does it store them? Chroma DB by default, uses an all-MiniLM-L6-v2 vector embedding model to create the embeddings for us. This model will take our documents and convert them into vector embeddings. If we want to work with a specific embedding function like other sentence-transformer models from HuggingFace or OpenAI embedding model, we can specify it under the embeddings_function=embedding_function_name variable name in the create_collection() method.
We can also provide embeddings directly to the Vector Store, instead of passing the documents to it. Just like the document parameter in create_collection, we have an embedding parameter, to which we pass on the embeddings that we want to store in the Vector Database.
So now the model has successfully stored our three documents in the form of vector embeddings in the vector store. Now, we will look at retrieving relevant documents from them. We will pass a query and will fetch the documents that are relevant to it.
Not always do we add all the information at once to the Vector Store. In most cases, we have only limited data/documents at the start, which we add as is to the Vector Store. Later in point of time, when we get more data, it becomes necessary to update the existing data/vector embeddings present in the Vector Store. To update data in Chroma DB, we do the following
Previously, the information in the document associated with id2 was about Dogs. Now we are changing it to Cats. For this information to be updated within the Vector Store, we pass the id of the document, the updated document, and the updated metadata of the document to the update() function of the collections. This will now update the id2 to Cats which was previously about Dogs.
pass in Felines as the query to the Vector Store. Cats belong to the family of mammals called Felines. So the collection must return the Cat document as the relevant document to us. In the output, we get to see exactly the same. The vector store was able to perform a semantic search between the query and the contents of the documents and was able to return the perfect document to the query provided.
There is a similar function to the update function called the upsert() function. The only difference between both the update() and upsert() function is, if the document ID specified in the update() function does not exist, the update() function will raise an error. But in the case of the upsert() function, if the document ID doesn’t exist in the collection, then it will be added to the collection similar to the add() function.
Sometimes, to reduce the space or remove unnecessary/ unwanted information, we might want to delete some documents from the collection in the Vector Store.
To delete an item from a collection, we have the delete() function. In the above, we are deleting the first document associated with id1 which was about cars. Now to check, we query the collection with the “car” as the query and then see the results. We see that only 2 documents id2 and id3 appear, where the id2 is the document about four wheelers which are closest to cars and id3 is the document about cats which is the least closest to cars, but as we specified n_results = 2 we get the id3 as well. If we do not specify any variables to the delete() function, then all the items will be deleted from that collection
We have seen how to create a new collection and then add documents, and embeddings to it. We have even seen how to extract relevant information to a query from the collection i.e. from the documents stored in the Vector Store. The collections object from Chroma DB is also associated with many other useful functions.
Let us look at some other functionalities provided by Chroma DB.
The count() function from the collections returns the number of items present in the collection. In our case, we have 3 documents stored in our collection, hence the output will be 3. Coming to the get() function, it will return all the items that are present in our collection along with the metadata, ids, and embeddings if any. In the output, we see that all the items that we have to our collection have to get through the get() command. Let’s now look at modifying the collection name
Use the modify() function from collections to change the name of the collection that was given at the start of collection creation. When run, change the collection name from the old name that was defined at the start to the new name provided in the modify() function under the name variable. Now suppose, we have multiple collections in our Vector Store. How to work on a specific collection, that is how to get a specific collection from the Vector Store and how to delete a specific collection? Let’s see this
The get collection() function will fetch an existing collection provided the name, from the Vector Store. If the provided collection does not exist, then the function will raise an error for the same. Here the get_collection() will try to get the my_information_2 collection and assign it to the variable my_collection. To delete an existing collection, we have the delete_collection() function, which takes the collection name as the parameter (my_information in this case) and then deletes it, if it exists.
In this guide, we have seen how to get started with Chroma, one of the Open Source Vector Databases. We initially started with learning what are vector embeddings, why they are necessary for the Generative AI models, and how Vector Stores help these Generative Large Language Models. Then we deep-dived into Chroma, and we have seen how to create collections in Chroma. Then we looked into how to add data like documents to Chroma and how the Chroma DB creates vector embeddings out of them. Finally, we have seen how to retrieve relevant information related to the given query from a particular collection present in the Vector Store.
Are you looking to become a Data Scientist? Go through 360DigiTMG's Data Science Course in Chennai
Some of the key takeaways from this guide include:
ECIL, Jaipur, Pune, Gurgaon, Salem, Surat, Agra, Ahmedabad, Amritsar, Anand, Anantapur, Andhra Pradesh, Anna Nagar, Aurangabad, Bhilai, Bhopal, Bhubaneswar, Borivali, Calicut, Cochin, Chengalpattu, Dehradun, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Guduvanchery, Gwalior, Hebbal, Hoodi , Indore, Jabalpur, Jaipur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Kanpur, Khammam, Kochi, Kolhapur, Kolkata, Kothrud, Ludhiana, Madurai, Mangalore, Meerut, Mohali, Moradabad, Pimpri, Pondicherry, Porur, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thoraipakkam , Tiruchirappalli, Tirunelveli, Trichur, Trichy, Udaipur, Vijayawada, Vizag, Warangal, Chennai, Coimbatore, Delhi, Dilsukhnagar, Hyderabad, Kalyan, Nagpur, Noida, Thane, Thiruvananthapuram, Uppal, Kompally, Bangalore, Chandigarh, Chromepet, Faridabad, Guntur, Guwahati, Kharadi, Lucknow, Mumbai, Mysore, Nashik, Navi Mumbai, Patna, Pune, Raipur, Vadodara, Varanasi, Yelahanka
360DigiTMG - Data Analytics, Data Science Course Training in Chennai
1st Floor, Santi Ram Centre, Tirumurthy Nagar, Opposite to Indian Oil Bhavan, Nungambakkam, Chennai - 600006
1800-212-654-321
Didn’t receive OTP? Resend
Let's Connect! Please share your details here