Vector Databases: Demystifying the Term and Exploring the Excitement.
Key trade-offs in choosing a vector database solution.
Vector databases have garnered significant attention recently, with over 10 companies offering various vector database solutions. The existence of numerous options prompts questions like what is a vector database, and why are there so many choices? Should you transition your database to a vector database? To address these, let's begin by defining the concept of data.
A historical perspective on databases
Data, stored digitally in computers, can be organized or semi-structured. Typically, it's managed in a specialized system called a database for efficient access. Vectors, on the other hand, represent a specific data format, often compressed and semantically enriched, allowing them to represent various types of content, from text documents to audio files. Vector databases are designed to handle vectors at scale, focusing on semantic understanding for improved query outcomes compared to traditional keyword-based queries.
SQL databases, originating in the 1970s, are among the most mature database types, widely known for their structured data approach. They excel in managing transactional data, which often occurs sequentially and results in structured tables. Relational databases gain complexity by linking different tables to mirror real-world intricacies. However, their inflexibility can be limiting when dealing with diverse and rapidly accumulating data from various sources, especially in the era of big data.
This is where No-SQL databases come into play. They offer flexibility by adopting a schema-less approach, essentially storing data in semi-structured JSON format. This approach allows for horizontal scalability, distributing data across multiple machines and enabling efficient communication between them.
Transformers and their integration with databases.
Vector databases represent a natural evolution or extension of No-SQL databases. Prior to vector databases, database search involved declarative queries in SQL databases or JSON for No-SQL databases. The concept of full-text search emerged to extract information from vast datasets. In its early stages, full-text search relied on term frequencies within documents and their relative frequency compared to the entire dataset. Additionally, inverted file index algorithms factored in keywords, subgroups, and other attributes for querying, akin to a bag-of-words approach in NLP.
In more recent years, following the transformer revolution, transformers have demonstrated exceptional prowess in encoding semantics. When working with documents, transformers excel at extracting meaningful terms for classification and retrieval, surpassing other NLP techniques. This concept led to the development of vector databases, aiming to combine the strengths of transformers and databases to create a semantic-based search engine within databases.
In a vector database, the 'vector' component typically involves a transformer-based language model used to represent sentences as vectors. When a query is submitted to the database, it maps the query's semantics to the vectors in the database and computes the similarity between the query and the stored data.
Factors to weigh when selecting your vector database solution.
When paired with large language models, vector databases offer intriguing applications, including natural language querying. However, there are trade-offs to ponder when constructing and selecting a vector database. Typically, the main motive for exploring vector databases is to either enhance semantic search capabilities or extract semantic information alongside an existing application, like Postgres. You might wonder why not utilize the database's vector index directly? The challenge with this approach is well-documented; integrating it with the database's internals may miss out on optimization opportunities. Such a solution wouldn't be purpose-built for accelerating indexing performance, querying, and other critical functions. If the aim is to create a robust vector search system, it's often more sensible to consider a purpose-built solution.
The second significant trade-off in selecting a vector database solution revolves around the choice between a built-in embedding pipeline and building a custom one. A built-in embedding pipeline offers convenience for beginners, while some prefer leveraging open-source platforms like Hugging Face, which provide sentence transformers. With this approach, data can be processed through the pipeline to generate sentence embedding’s. Interestingly, certain database vendors offer APIs for such models, allowing you to choose your transformer pipeline without immediate custom coding.
Another trade-off to consider in vector databases is the balance between indexing and querying. Indexing involves encoding data into vectors and storing those using efficient data structures. The challenge lies in efficient searching through these vectors. Indexing aims to design data structures and store vectors in a way that allows efficient and scalable querying. This is an upfront, upstream process. In contrast, querying is a downstream process where user input is transformed into a vector and compared with database vectors.
Existing vector database solutions often specialize in either indexing or querying, but not both. Some excel at indexing but are less efficient at querying, while others prioritize querying but may be slower in indexing.
The Prospects of Vector Databases: Promising Opportunities and Ease of Use
Despite the trade-offs and complexities associated with vector databases, they hold great promise. Just a few years ago, the default search engine was the Google search bar. Now, Large Language Models (LLMs) are opening up possibilities for creating scalable and reliable in-house search engines using proprietary data. Another common application of vector databases is retrieval-augmented generation. Instead of simply returning documents based on a query, you can add a language model to the query process. This model can analyze your query and the document, retrieve only the relevant part of the document, and generate a response that potentially answers your question. Vector databases are uniquely positioned to bring added value in the realm of factual knowledge retrieval. Language models can explore knowledge graphs they have encoded in ways you might not have considered, uncovering new insights within your data.