# Vector Embeddings and RAG Demystified: Leveraging Amazon Bedrock, Aurora, and LangChain - Part 1

## Revolutionize big data handling and machine learning applications.

The Versatility of Embeddings in Applications

Creating Embeddings Using Boto3

Creating Embeddings Using LangChain

E-commerce Product Recommendation

Delving Deeper Into Vector Storage

Enhanced Locality-Sensitive Hashing (LSH)

Hierarchical Navigable Small World (HNSW)

Advanced Graph-Based Similarity Search Techniques

`LangChain`

, a tool that enhances our journey into the practical application of these concepts, demonstrating how to seamlessly integrate these technologies into real-world AI solutions.**embedding**? An embedding is a numerical representation of content in a form that machines can process and understand. The essence of the process is to convert an object, such as an image or text, into a vector that encapsulates its semantic content while discarding irrelevant details as much as possible. An embedding takes a piece of content, like a word, sentence, or image, and maps it into a multi-dimensional vector space. The distance between two embeddings indicates the semantic similarity between the corresponding concepts.

Consider the terms 'coffee' and 'tea'. In a hypothetical vocabulary space, these two could be transformed into numerical vectors. If we visualize this in a 3-dimensional vector space, 'coffee' might be represented as`[1.2, -0.9, 0.3]`

and 'tea' as`[1.0, -0.8, 0.5]`

. Such numerical vectors carry semantic information, indicating that 'coffee' and 'tea' are conceptually similar to each other due to their association with hot beverages and would likely be positioned closer together in the vector space than either would be to unrelated concepts like 'astronomy' or 'philosophy'.

**bag-of-words model**. Here, words within a text are represented by their frequency of occurrence. Scikit-learn, a powerful Python Library, encapsulates this method within its

`CountVectorizer`

tool. For a while, this was the standard—until the introduction of **word2vec**.

**Word2vec**represented a paradigm shift. It diverged from simply tallying words to understanding context by predicting a word's presence from its neighboring words while ignoring the sequence in which they appear. This method operates on a linear modeling approach.

**identifying edges, analyzing textures**, and looking at

**color patterns**. We do this over different sizes of image areas, making sure these embeddings understand changes in size and position.

**Convolutional Neural Networks (CNNs)**has significantly changed our approach to image analysis. CNNs, especially when pre-trained on large datasets like ImageNet, use what are known as

`filters`

or `kernels`

.A CNN operates by examining small portions of an image and recognizing various features like lines, colors, and shapes. It then progressively combines these features to understand more complex structures within the image, such as objects or faces. When applied to a new image, CNNs are capable of generating detailed and insightful vector representations. These representations are not just pixel-level data but a more profound understanding of the image's content, making CNNs invaluable in areas like facial recognition, medical imaging, and autonomous vehicles. Their ability to learn from vast amounts of data and identify intricate patterns makes them a cornerstone of modern image processing techniques.

`textual`

and `visual`

data, the trend has shifted towards **transformer-based models**.

`words in text`

or `pixels in images`

. Equipped with a large number of parameters, these models excel in identifying complex patterns and relationships through training on comprehensive datasets.`positive`

' or '`negative`

' sentiment indicators. This flexibility makes embeddings a fundamental component in many data-driven applications.There are different distance metrics used in vector similarity calculations such as:

Euclidean distance:It measures the straight-line distance between two vectors in a vector space. It ranges from 0 to infinity, where 0 represents identical vectors, and larger values represent increasingly dissimilar vectors.Cosine distance:This similarity measure calculates the cosine of the angle between two vectors in a vector space. It ranges from -1 to 1, where 1 represents identical vectors, 0 represents orthogonal vectors, and -1 represents vectors that are diametrically opposed.Dot product:This measure reflects the product of the magnitudes of two vectors and the cosine of the angle between them. Its range extends from -∞ to ∞, with a positive value indicating vectors that point in the same direction, 0 indicating orthogonal vectors, and a negative value indicating vectors that point in opposite directions.

`pgvector`

for similarity searches.`bedrock`

client; later, we will see how we can do the same using `LangChain`

.`1`

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

import boto3

import json

# Start a session with AWS using boto3

session = boto3.Session(profile_name='default')

# Initialize the bedrock-runtime client to interact with AWS AI services

bedrock = session.client(service_name='bedrock-runtime')

# Define a function to get the embedding for a given text

def get_embedding(text):

# Convert the input text to a JSON-formatted string

body = json.dumps({"inputText": text})

# Define the model identifier for the embedding service

model_id = 'amazon.titan-embed-text-v1'

# Specify the type for the request and the expected response

accept_type = 'application/json'

# Call the embedding model service, passing the prepared request body and headers

response = bedrock.invoke_model(body=body,

modelId=model_id,

accept=accept_type,

contentType=accept_type)

# Read the response body, expected to be in JSON format

response_body = json.loads(response['body'].read())

# Extract the embedding from the response

embedding = response_body['embedding']

# Return the extracted embedding

return embedding

# Build a list of embeddings to compare

embeddings_using_boto = []

# Append the embedding of sentences to the list

embeddings_using_boto.append(get_embedding("Sunny skies today."))

embeddings_using_boto.append(get_embedding("Language learning is fun."))

embeddings_using_boto.append(get_embedding("Cats are independent."))

embeddings_using_boto.append(get_embedding("Stocks go up and down."))

embeddings_using_boto.append(get_embedding("Home cooking is healthy."))

# Print the total number of embeddings in the list; expected output is 5

print("Total embeddings:", len(embeddings_using_boto))

# Print the length of each embedding vector in the list; each should print the same length, e.g., 1536

print("Embedding lengths:", [len(vec) for vec in embeddings_using_boto])

`1`

2

3

# Expected output

Total embeddings: 5

Embedding lengths: [1536, 1536, 1536, 1536, 1536]

- Initializes a session with AWS using
`Boto3`

and creates a client for the`bedrock-runtime`

service. - Defines a function
`get_embedding`

, which accepts a text input, then utilizes the**Amazon Titan Embeddings**model to transform this text into an`embedding`

. Once the embedding is generated, the function returns the embedding vector.

`embed_query()`

method from the `BedrockEmbeddings`

class.`1`

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

from langchain.embeddings import BedrockEmbeddings

# Initialize an instance of BedrockEmbeddings

embeddings = BedrockEmbeddings()

# Initialize an empty list to store the embeddings

embeddings_using_lc = []

# Append the embedding of sentences to the list

embeddings_using_lc.append(embeddings.embed_query("Sunny skies today."))

embeddings_using_lc.append(embeddings.embed_query("Language learning is fun."))

embeddings_using_lc.append(embeddings.embed_query("Cats are independent."))

embeddings_using_lc.append(embeddings.embed_query("Stocks go up and down."))

embeddings_using_lc.append(embeddings.embed_query("Home cooking is healthy."))

# Print the total number of embeddings in the list; expected output is 5

print("Total embeddings:", len(embeddings_using_lc))

# Print the length of each embedding vector in the list; each should print the same length, e.g., 1536

print("Embedding lengths:", [len(vec) for vec in embeddings_using_lc])

`1`

2

3

# Expected output

Total embeddings: 5

Embedding lengths: [1536, 1536, 1536, 1536, 1536]

- Imports the
`BedrockEmbeddings`

class from`langchain`

. - Creates an instance of
`BedrockEmbeddings`

to generate embeddings. - Appends embeddings of several sentences to a
`list`

.

`embed_documents()`

method as well.`1`

2

3

4

5

6

7

8

9

10

11

12

13

from langchain.embeddings import BedrockEmbeddings

# Initialize an instance of BedrockEmbeddings

embeddings = BedrockEmbeddings()

# Obtain embeddings for a list of sentences

embeddings_using_lc_2 = embeddings.embed_documents([

"Sunny skies today.",

"Language learning is fun.",

"Cats are independent.",

"Stocks go up and down.",

"Home cooking is healthy."

])

`LangChain`

or `boto3`

. This uniformity is attributed to the underlying model in use, which is `Amazon Titan Embeddings G1 - Text`

.`1`

2

3

# Compare the first embedding from each method to verify they are the same

are_embeddings_equal = (embeddings_using_boto[0] == embeddings_using_lc[0] == embeddings_using_lc_2[0])

print("Are the first embeddings from each method equal?", are_embeddings_equal)

`1`

2

Are the first embeddings from each method equal? True

`1`

2

3

# Print the text and its corresponding embedding

print("Text:", "Sunny skies today.")

print("Embedding:", embeddings_using_boto[0])

`1`

2

3

# Expected output

Text: Sunny skies today.

Embedding: [1.21875, 0.122558594, ..., -0.021362305] # An array of length 1536

**LangChain**, and I have incorporated several of its narratives into this blog post.

`vector`

is crucial, but equally important is knowing how to store these vectors. Before diving into storage methods, let's briefly touch on **Vector Search**, which underscores the need for storing embeddings.

**Vector search**involves representing each data point as a

`vector`

in a high-dimensional space, capturing the data's features or characteristics. The aim is to identify vectors most similar to a given query vector. We've seen how these vector embeddings, numerical arrays representing coordinates in a high-dimensional space, are crucial in measuring distances using metrics like `cosine similarity`

or `euclidean distance`

, which we discussed earlier.Imagine an e-commerce platform where each product has a vector representing its features like color, size, category, and user ratings. When a user searches for a product, the search query is converted into a vector. The system then performs a vector search to find products with similar feature vectors, suggesting these as recommendations.This process requires efficient vector storage. A vector storage mechanism is essential for storing and retrieving vector embeddings. While standalone solutions exist for this, vector databases like Amazon Aurora (with 'pgvector'), Amazon OpenSearch, and Amazon Kendra offer more integrated functionalities. They not only store but also manage large sets of vectors, using indexing mechanisms for efficient similarity searches. We will dive into vector stores/database in the next section.

**Indexing**: This is about organizing vectors to speed up retrieval. Techniques like k-d trees or Annoy are employed for this.**Vector libraries**: These offer functions for operations like dot product and vector indexing.**Vector databases**: They are specifically designed for storing, managing, and retrieving vast sets of vectors. Examples include Amazon Aurora (with`pgvector`

), Amazon OpenSearch, and Amazon Kendra, which utilize indexing for efficient searches.

**Indexing**in the context of vector embeddings is a method of organizing data to optimize its retrieval. It’s akin to indexing in traditional database systems, where it allows quicker access to records. For vector embeddings, indexing aims to structure the vectors so that similar vectors are stored adjacently, enabling fast proximity or similarity searches. Algorithms like K-dimensional trees (k-d trees) are commonly applied, but many others like Ball Trees, Annoy, and FAISS are often implemented, especially for high-dimensional vectors.

K-Nearest Neighbor (KNN) is a straightforward algorithm used for classification and regression tasks. In KNN, the class or value of a data point is determined by its k nearest neighbors in the training dataset.

**Selecting k**: Decide the number of nearest neighbors (k) to influence the classification or regression.**Distance Calculation**: Measure the distance between the point to classify and every point in the training dataset.**Identifying Nearest Neighbors**: Choose the k closest data points.**Classifying or Regressing**:*For classification*: Assign the class based on the most frequent class within the k neighbors.*For regression*: Use the average value from the k neighbors as the prediction.

**Making Predictions**: The algorithm assigns a predicted class or value to the new data point.

`O(nd)`

, where `n`

is the number of vectors and `d`

is the vector dimension. This scalability issue is addressed with Approximate Nearest Neighbor algorithms (ANN) for faster search.**Product Quantization****Locality-sensitive hashing****Hierarchical Navigable Small World (HNSW)**

**Vector Breakdown**: The first step in PQ involves breaking down each high-dimensional vector into smaller`sub-vectors`

. By dividing the vector into segments, PQ can manage each piece individually, simplifying the subsequent clustering process.

**Cluster Formation via K-means**: Each sub-vector is then processed through a`k-means`

clustering algorithm. This is like finding representative landmarks for different neighborhoods within a city, where each landmark stands for a group of nearby locations. We can see multiple clusters formed from the sub-vectors, each with its`centroid`

. These`centroids`

are the key players in PQ; instead of indexing every individual vector, PQ only stores the centroids, significantly reducing memory requirements.

**Centroid Indexing**: In PQ, we don't store the full detail of every vector; instead, we index the centroids of the clusters they belong to, as demonstrated in the first image. By doing this, we achieve data compression. For example, if we use two clusters per partition and have six vectors, we achieve a 3X compression rate. This compression becomes more significant with larger datasets.**Nearest Neighbor Search**: When a query vector comes in, PQ doesn't compare it against all vectors in the database. Instead, it only needs to measure the squared euclidean distance from the centroids of each cluster. It's a quicker process because we're only comparing the query vector to a handful of centroids rather than the entire dataset.

**Balance Between Accuracy and Efficiency**: The trade-off here is between the granularity of the clustering (how many clusters are used) and the speed of retrieval. More clusters mean finer granularity and potentially more accurate results but require more time to search through.

**Dimensionality Reduction**: Initially, vectors are projected onto a lower-dimensional space using a random matrix. This step simplifies the data, making it more manageable and reducing the computational load for subsequent operations.**Binary Hashing**: After dimensionality reduction, each component of the projected vector is 'binarized', typically by assigning a`1`

if the component is`positive`

and a`0`

if`negative`

. This binary hash code represents the original vector in a much simpler form.**Bucket Assignment**: Vectors that share the same binary hash code are assigned to the same bucket. By doing so, LSH groups vectors that are likely to be similar into the same '`bin`

', allowing for quicker retrieval based on hash codes.

`hierarchical`

part kicks in. We create multiple layers of graphs, each with different bridge lengths. The top layer has the longest bridges, while the bottom layer has the shortest.In addition to HNSW and KNN, there are other ways to find similar items or patterns using graphs, such as with Graph Neural Networks (GNN) and Graph Convolutional Networks (GCN). These methods use the connections and relationships in graphs to search for similarities. There's also the Annoy (Approximate Nearest Neighbors Oh Yeah) algorithm, which sorts vectors using a tree structure made of random divisions, kind of like sorting books on shelves based on different categories. Annoy is user-friendly and good for quickly finding items that are almost, but not exactly, the same.When choosing one of these methods, it's important to consider how fast you need the search to be, how precise the results should be, and how much computer memory you can use. The right choice depends on what the specific task needs and the type of data you're working with.

**FAISS (Facebook AI Similarity Search)**: Developed by Meta (formerly Facebook), this library helps find and group together similar dense vectors, which are just vectors with a lot of numbers. It's great for big search tasks and works well with both normal computers and those with powerful GPUs.**Annoy**: This is a tool created by Spotify for searching near-identical vectors in high-dimensional spaces (which means lots of data points). It's built to handle big data and uses a bunch of random tree-like structures for searching.**hnswlib**: This library uses the HNSW (Hierarchical Navigable Small World) algorithm. It's known for being fast and not needing too much memory, making it great for dealing with lots of high-dimensional vector data.**nmslib (Non-Metric Space Library)**: It’s an open-source tool that's good at searching through non-metric spaces (spaces where distance isn't measured in the usual way). It uses different algorithms like HNSW and SW-graph for searching.

Vector databases are key in managing and analyzing machine learning models and their embeddings. They shine in similarity or semantic search, enabling quick and efficient navigation through massive datasets of text, images, or videos to find items matching specific queries based on vector similarities. This technology finds diverse applications, including:ForAnomaly Detection, vector databases compare embeddings to identify unusual patterns, crucial in areas like fraud detection and network security. InPersonalization, they enhance recommendation systems by aligning similar vectors with user preferences. In the realm ofNatural Language Processing (NLP), these databases facilitate tasks like sentiment analysis and text classification by effectively comparing and analyzing text represented as vector embeddings.As the technology evolves, vector databases continue to find new and innovative applications, broadening the scope of how we handle and analyze large datasets in various fields.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.