Vector Embeddings and RAG Demystified: Leveraging Amazon Bedrock, Aurora, and LangChain - Part 1
Revolutionize big data handling and machine learning applications.
The Versatility of Embeddings in Applications
Creating Embeddings Using Boto3
Creating Embeddings Using LangChain
E-commerce Product Recommendation
Delving Deeper Into Vector Storage
Enhanced Locality-Sensitive Hashing (LSH)
Hierarchical Navigable Small World (HNSW)
Advanced Graph-Based Similarity Search Techniques
LangChain
, a tool that enhances our journey into the practical application of these concepts, demonstrating how to seamlessly integrate these technologies into real-world AI solutions.Consider the terms 'coffee' and 'tea'. In a hypothetical vocabulary space, these two could be transformed into numerical vectors. If we visualize this in a 3-dimensional vector space, 'coffee' might be represented as[1.2, -0.9, 0.3]
and 'tea' as[1.0, -0.8, 0.5]
. Such numerical vectors carry semantic information, indicating that 'coffee' and 'tea' are conceptually similar to each other due to their association with hot beverages and would likely be positioned closer together in the vector space than either would be to unrelated concepts like 'astronomy' or 'philosophy'.
CountVectorizer
tool. For a while, this was the standard—until the introduction of word2vec.filters
or kernels
.A CNN operates by examining small portions of an image and recognizing various features like lines, colors, and shapes. It then progressively combines these features to understand more complex structures within the image, such as objects or faces. When applied to a new image, CNNs are capable of generating detailed and insightful vector representations. These representations are not just pixel-level data but a more profound understanding of the image's content, making CNNs invaluable in areas like facial recognition, medical imaging, and autonomous vehicles. Their ability to learn from vast amounts of data and identify intricate patterns makes them a cornerstone of modern image processing techniques.
textual
and visual
data, the trend has shifted towards transformer-based models.words in text
or pixels in images
. Equipped with a large number of parameters, these models excel in identifying complex patterns and relationships through training on comprehensive datasets.positive
' or 'negative
' sentiment indicators. This flexibility makes embeddings a fundamental component in many data-driven applications.There are different distance metrics used in vector similarity calculations such as:
Euclidean distance: It measures the straight-line distance between two vectors in a vector space. It ranges from 0 to infinity, where 0 represents identical vectors, and larger values represent increasingly dissimilar vectors. Cosine distance: This similarity measure calculates the cosine of the angle between two vectors in a vector space. It ranges from -1 to 1, where 1 represents identical vectors, 0 represents orthogonal vectors, and -1 represents vectors that are diametrically opposed. Dot product: This measure reflects the product of the magnitudes of two vectors and the cosine of the angle between them. Its range extends from -∞ to ∞, with a positive value indicating vectors that point in the same direction, 0 indicating orthogonal vectors, and a negative value indicating vectors that point in opposite directions.
pgvector
for similarity searches.bedrock
client; later, we will see how we can do the same using LangChain
.- Initializes a session with AWS using
Boto3
and creates a client for thebedrock-runtime
service. - Defines a function
get_embedding
, which accepts a text input, then utilizes the Amazon Titan Embeddings model to transform this text into anembedding
. Once the embedding is generated, the function returns the embedding vector.
embed_query()
method from the BedrockEmbeddings
class.- Imports the
BedrockEmbeddings
class fromlangchain
. - Creates an instance of
BedrockEmbeddings
to generate embeddings. - Appends embeddings of several sentences to a
list
.
embed_documents()
method as well.LangChain
or boto3
. This uniformity is attributed to the underlying model in use, which is Amazon Titan Embeddings G1 - Text
.vector
is crucial, but equally important is knowing how to store these vectors. Before diving into storage methods, let's briefly touch on Vector Search, which underscores the need for storing embeddings.vector
in a high-dimensional space, capturing the data's features or characteristics. The aim is to identify vectors most similar to a given query vector. We've seen how these vector embeddings, numerical arrays representing coordinates in a high-dimensional space, are crucial in measuring distances using metrics like cosine similarity
or euclidean distance
, which we discussed earlier.Imagine an e-commerce platform where each product has a vector representing its features like color, size, category, and user ratings. When a user searches for a product, the search query is converted into a vector. The system then performs a vector search to find products with similar feature vectors, suggesting these as recommendations.This process requires efficient vector storage. A vector storage mechanism is essential for storing and retrieving vector embeddings. While standalone solutions exist for this, vector databases like Amazon Aurora (with 'pgvector'), Amazon OpenSearch, and Amazon Kendra offer more integrated functionalities. They not only store but also manage large sets of vectors, using indexing mechanisms for efficient similarity searches. We will dive into vector stores/database in the next section.
- Indexing: This is about organizing vectors to speed up retrieval. Techniques like k-d trees or Annoy are employed for this.
- Vector libraries: These offer functions for operations like dot product and vector indexing.
- Vector databases: They are specifically designed for storing, managing, and retrieving vast sets of vectors. Examples include Amazon Aurora (with
pgvector
), Amazon OpenSearch, and Amazon Kendra, which utilize indexing for efficient searches.
K-Nearest Neighbor (KNN) is a straightforward algorithm used for classification and regression tasks. In KNN, the class or value of a data point is determined by its k nearest neighbors in the training dataset.
- Selecting k: Decide the number of nearest neighbors (k) to influence the classification or regression.
- Distance Calculation: Measure the distance between the point to classify and every point in the training dataset.
- Identifying Nearest Neighbors: Choose the k closest data points.
- Classifying or Regressing:
- For classification: Assign the class based on the most frequent class within the k neighbors.
- For regression: Use the average value from the k neighbors as the prediction.
- Making Predictions: The algorithm assigns a predicted class or value to the new data point.
O(nd)
, where n
is the number of vectors and d
is the vector dimension. This scalability issue is addressed with Approximate Nearest Neighbor algorithms (ANN) for faster search.- Product Quantization
- Locality-sensitive hashing
- Hierarchical Navigable Small World (HNSW)
- Vector Breakdown: The first step in PQ involves breaking down each high-dimensional vector into smaller
sub-vectors
. By dividing the vector into segments, PQ can manage each piece individually, simplifying the subsequent clustering process.
- Cluster Formation via K-means: Each sub-vector is then processed through a
k-means
clustering algorithm. This is like finding representative landmarks for different neighborhoods within a city, where each landmark stands for a group of nearby locations. We can see multiple clusters formed from the sub-vectors, each with itscentroid
. Thesecentroids
are the key players in PQ; instead of indexing every individual vector, PQ only stores the centroids, significantly reducing memory requirements.
- Centroid Indexing: In PQ, we don't store the full detail of every vector; instead, we index the centroids of the clusters they belong to, as demonstrated in the first image. By doing this, we achieve data compression. For example, if we use two clusters per partition and have six vectors, we achieve a 3X compression rate. This compression becomes more significant with larger datasets.
- Nearest Neighbor Search: When a query vector comes in, PQ doesn't compare it against all vectors in the database. Instead, it only needs to measure the squared euclidean distance from the centroids of each cluster. It's a quicker process because we're only comparing the query vector to a handful of centroids rather than the entire dataset.
- Balance Between Accuracy and Efficiency: The trade-off here is between the granularity of the clustering (how many clusters are used) and the speed of retrieval. More clusters mean finer granularity and potentially more accurate results but require more time to search through.
- Dimensionality Reduction: Initially, vectors are projected onto a lower-dimensional space using a random matrix. This step simplifies the data, making it more manageable and reducing the computational load for subsequent operations.
- Binary Hashing: After dimensionality reduction, each component of the projected vector is 'binarized', typically by assigning a
1
if the component ispositive
and a0
ifnegative
. This binary hash code represents the original vector in a much simpler form. - Bucket Assignment: Vectors that share the same binary hash code are assigned to the same bucket. By doing so, LSH groups vectors that are likely to be similar into the same '
bin
', allowing for quicker retrieval based on hash codes.
hierarchical
part kicks in. We create multiple layers of graphs, each with different bridge lengths. The top layer has the longest bridges, while the bottom layer has the shortest.In addition to HNSW and KNN, there are other ways to find similar items or patterns using graphs, such as with Graph Neural Networks (GNN) and Graph Convolutional Networks (GCN). These methods use the connections and relationships in graphs to search for similarities. There's also the Annoy (Approximate Nearest Neighbors Oh Yeah) algorithm, which sorts vectors using a tree structure made of random divisions, kind of like sorting books on shelves based on different categories. Annoy is user-friendly and good for quickly finding items that are almost, but not exactly, the same.When choosing one of these methods, it's important to consider how fast you need the search to be, how precise the results should be, and how much computer memory you can use. The right choice depends on what the specific task needs and the type of data you're working with.
- FAISS (Facebook AI Similarity Search): Developed by Meta (formerly Facebook), this library helps find and group together similar dense vectors, which are just vectors with a lot of numbers. It's great for big search tasks and works well with both normal computers and those with powerful GPUs.
- Annoy: This is a tool created by Spotify for searching near-identical vectors in high-dimensional spaces (which means lots of data points). It's built to handle big data and uses a bunch of random tree-like structures for searching.
- hnswlib: This library uses the HNSW (Hierarchical Navigable Small World) algorithm. It's known for being fast and not needing too much memory, making it great for dealing with lots of high-dimensional vector data.
- nmslib (Non-Metric Space Library): It’s an open-source tool that's good at searching through non-metric spaces (spaces where distance isn't measured in the usual way). It uses different algorithms like HNSW and SW-graph for searching.
Vector databases are key in managing and analyzing machine learning models and their embeddings. They shine in similarity or semantic search, enabling quick and efficient navigation through massive datasets of text, images, or videos to find items matching specific queries based on vector similarities. This technology finds diverse applications, including:For Anomaly Detection, vector databases compare embeddings to identify unusual patterns, crucial in areas like fraud detection and network security. In Personalization, they enhance recommendation systems by aligning similar vectors with user preferences. In the realm of Natural Language Processing (NLP), these databases facilitate tasks like sentiment analysis and text classification by effectively comparing and analyzing text represented as vector embeddings.As the technology evolves, vector databases continue to find new and innovative applications, broadening the scope of how we handle and analyze large datasets in various fields.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.