Home / Blog / Generative AI / Introducing Rockset Vector Database

Introducing Rockset Vector Database

March 07, 2024
89

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 18+ years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Introduction

Rockset stands as a pioneering force in the realm of data analytics, driven by a commitment to revolutionize the way businesses harness and interpret their data resources. Renowned for its agility and innovation, Rockset provides a robust platform empowering users to seamlessly explore, analyze, and derive actionable insights from extensive datasets. Its reputation for introducing transformative features underscores its dedication to facilitating real-time analytics, consistently pushing the boundaries of what's achievable in data-driven decision-making.

At the forefront of innovation, Rockset is poised to introduce a groundbreaking addition: vector search. This forthcoming feature is poised to redefine the search landscape, promising rapid, precise searches that go beyond mere retrieval. Vector search is set to usher in a new era of data exploration, offering not just efficient search capabilities but also the potential to enhance personalized recommendations, fortify fraud detection systems, and introduce efficiency in navigating complex data landscapes. As Rockset gears up to unveil this game-changing feature, the industry eagerly anticipates the new possibilities it will unlock in data exploration and analysis

Significance of vector search

Enterprises constantly amass a wealth of diverse data types, spanning textual documents, multimedia files, and information generated by machines and sensors. Astonishingly, a staggering 80% of all amassed data is classified as unstructured. Despite this abundance, organizations struggle to harness its potential, leveraging only a fraction of this resource to glean valuable insights, steer strategic decision-making, and craft immersive user experiences. The inherent challenge lies in the complexity of making sense of this unstructured data, a process that demands not just technical proficiency but also deep domain-specific knowledge. As a consequence of these complexities, a significant portion of this reservoir of unstructured data remains largely untapped, underscoring the untapped potential that lies dormant within organizations.

With the advancement in technologies like machine learning, neural networks and LLM’s(Large Language Models), organizations can now turn unstructured data into something called embeddings, which are like organized labels represented as vectors. Vector search uses these labels to spot patterns and measure how similar different parts of the unstructured data are to each other.

Before the advent of vector search, search experiences were primarily reliant on keyword searches, which often required manually tagging data to identify and provide relevant results. The process of manually tagging documents involved numerous steps, such as

creating taxonomies,
comprehending search patterns,
analyzing input documents, and
maintaining customized rule sets.

For instance, if tagged keywords were sought to deliver product results, the manual tagging of terms like "Minecraft" as a "sandbox game" and "exploration game" was necessary. Additionally, identifying and tagging phrases resembling "sandbox game," such as "open-world adventure" and "creative exploration," was crucial to ensure the delivery of pertinent search outcomes.

In more recent times, keyword search has begun depending on term proximity, leveraging tokenization. Tokenization works by breaking down titles, descriptions, and documents into individual words or parts of words. Then, term proximity functions use matches between these individual words and search terms to provide results. While tokenization lessens the need for manually tagging and handling search criteria, keyword search still falls short in delivering semantically similar results. This limitation is particularly noticeable in natural language, which relies on connections between words and phrases.

Vector search enables us to utilize text embeddings, capturing meaningful connections among words, phrases, and sentences to enhance search experiences significantly. For instance, we could employ vector search to identify games featuring " block building, survival challenges, and cooperative gameplay." Rather than manually tagging each game according to these specific criteria or breaking down every game description for precise matches, vector search automates this process. This automation ensures the delivery of more relevant results, streamlining the search process significantly.

The Role of Embeddings in Vector Search

Embeddings, portrayed as number arrays or vectors, encapsulate the essence of unstructured data—text, audio, images, and videos. They present this information in a format that computational models can readily comprehend and manipulate.

In exploring these relationships, embeddings can be utilized to discern the associations between terms such as “Minecraft,” “Roblox,” and “Stardew Valley.” These terms are given meaning by generating embeddings, which coalesce when mapped onto a multi-dimensional space. In this spatial context, each term is assigned specific coordinates (x, y), allowing comprehension of their similarities by gauging the distances and angles between them.

Real-world scenarios often involve unstructured data encompassing billions of data points that translate into embeddings with thousands of dimensions. Vector search examines these embeddings, identifying terms situated in close proximity, such as “Minecraft” and “Roblox,” while also capturing closer associations, including synonyms like “Stardew Valley” and analogous terms within the dataset.

The surge in popularity of vector search can be attributed to enhancements in accuracy and increased accessibility to models responsible for generating embeddings. Models like BERT have revolutionized natural language processing, producing embeddings spanning thousands of dimensions. For instance, OpenAI’s text embedding model, text-embedding-ada-002, crafts embeddings comprising 1,526 dimensions, presenting a comprehensive portrayal of the inherent language structure.

Rockset for fast and efficient search

Once armed with embeddings for unstructured data, Rockset harnesses vector search to pinpoint similarities among these embeddings. With a suite of distance functions like dot product, cosine similarity, and Euclidean distance, Rockset computes the likeness between embeddings and search inputs, backing up K-Nearest Neighbors (KNN) search to retrieve the most comparable embeddings.

Rockset now hosts vector search capabilities, capitalizing on recently released vector operations and distance functions. It aligns with other vector databases like Milvus, Pinecone, and Weaviate, as well as alternatives like Elasticsearch, in indexing and storing vectors. Rockset employs its Converged Index technology, tailored for metadata filtering, vector search, and keyword search, facilitating sub-second search speeds, aggregations, and scalable joins.

Rockset brings forth various advantages with vector search support for tailored experiences

Real-Time Data: Swiftly ingest and index incoming data, accommodating updates in real-time.
Feature Generation: Aggregate and transform data on-the-fly during ingestion, minimizing data storage requirements while generating intricate features.
Expedited Search: Integrate vector search with selective metadata filtering for swift, effective results.
Hybrid Search Plus Analytics: Fuse other datasets with vector search outcomes, enabling enriched experiences via SQL.
Fully-Managed Cloud Service: Execute these operations on a scalable, cloud-native database, optimizing cost-efficiency through segregated computing and storage layers.

Recommendations using OpenAI and Rockset

To find relevant product on Amazon, one can leverage upon OpenAI and Rockset to perform semantic search.

In this demonstration, publicly available product data from Amazon will be utilized, encompassing product listings and reviews.

Generate Embeddings

The initial step in this guide involves employing OpenAI’s text embeddings API to create embeddings for Amazon product descriptions. The text-embedding-ada-002 model was chosen due to its performance, accessibility, and compact embedding size, although several models from HuggingFace, runnable locally, were also taken under consideration.

The OpenAI model produces a 1,536-element embedding. Throughout this guide, embeddings will be generated and preserved for 8,592 video game product descriptions from Amazon. Additionally, an embedding will be created for the demonstration's search query: “space and adventure, open-world play, and multiplayer options.”

The following code can be used to generate the embeddings

Upload Embeddings to Rockset

The subsequent step involves uploading these embeddings and the product data to Rockset, initiating the creation of a new collection to commence vector search. The process unfolds as follows

A collection is created in Rockset by uploading the file containing video game product listings and their corresponding embeddings. Alternatively, data could have been extracted seamlessly from other storage sources such as Amazon S3, Snowflake, or streaming services like Kafka and Amazon Kinesis, utilizing Rockset’s built-in connectors. Subsequently, Ingest Transformations can be applied, utilizing SQL to modify the data during the ingestion process. This transformation process includes the utilization of Rockset’s newly introduced VECTOR_ENFORCE function, validating the length and elements of incoming arrays. This validation ensures alignment between vectors, facilitating seamless query execution.

Run Vector Search on Rockset

Next step is to execute vector search on Rockset utilizing the newly introduced distance functions. The COSINE_SIM function receives the description embeddings field as one argument and the search query embedding as another, seamlessly orchestrated within comprehensive SQL commands provided by Rockset.

In this demonstration, the search query embedding was manually copy-pasted into the COSINE_SIM function within the SELECT statement. Alternatively, real-time generation of the embedding could have been achieved by directly invoking the OpenAI Text Embedding API and conveying the embedding to Rockset as a Query Lambda parameter.

Leveraging Rockset’s Converged Index, KNN search queries exhibit exceptional performance, especially when coupled with selective metadata filtering. Rockset strategically applies these filters before computing similarity scores, streamlining the search process by solely evaluating scores for pertinent documents. In this specific vector search query, filters can be implemented based on price and game developer criteria, ensuring that results fall within designated price ranges and are compatible with a specified gaming device.

In just 15 milliseconds, Rockset swiftly filters through over 8,500 documents, employing brand and price filters before computing similarity scores. This lightning-fast process occurs on a Large Virtual Instance equipped with 16 vCPUs and 128 GiB of allocated memory. Here are snippets from the top three results based on the search query "space and adventure, open-world play, and multiplayer options"

An immersive role-playing adventure designed for 1 to 4 players, offering a deep dive into a fantastical world and marking the inception of a new series.
Join the journey of a stranded spaceman on a mysterious planet, urgently scavenging for spacecraft components within a limited timeframe.
Engage in fast-paced multiplayer action with modes catering to up to four players, including Deathmatch, Cop Mode, and Tag.

In summary, Rockset efficiently conducts semantic searches in approximately 15 milliseconds, utilizing OpenAI-generated embeddings. This process combines vector search techniques with metadata filtering, delivering quicker and more pertinent results.

Significance for Search: Impact and Implications

In the demonstration above it has been illustrated how vector search empowers semantic search, and there are numerous scenarios where rapid and relevant search proves beneficial

Personalization & Recommendation Engines: Apply vector search in e-commerce sites and consumer apps to gauge interests based on past purchases and page views. It aids in generating personalized product recommendations by recognizing similarities between users.
Anomaly Detection: Utilize vector search to spot unusual transactions by comparing their similarities (or differences) to legitimate past transactions. Embeddings based on attributes like transaction amount, location, and time enhance anomaly identification.
Predictive Maintenance: Deploy vector search in analyzing factors like engine temperature, oil pressure, and brake wear in vehicle fleets. By comparing readings to healthy truck references, it identifies potential issues like engine malfunctions or worn-out brakes.

The use of unstructured data is projected to soar with increased accessibility to large language models and declining costs for generating embeddings. Rockset plays a vital role in expediting the convergence of real-time machine learning and analytics. Its fully-managed, cloud-native service simplifies the adoption of vector search.

Now, search functionalities are more accessible than ever. There's no need for intricate rule-based algorithms or manual configuration of text tokenizers and analyzers. There are endless possibilities.