How to use Retrieval Augmented Generation (RAG) for Go applications

Generative AI development has been democratised, thanks to powerful Machine Learning models (specifically Large Language Models such as Claude, Meta's LLama 2, etc.) being exposed by managed platforms/services as API calls. This frees developers from the infrastructure concerns and lets them focus on the core business problems. This also means that developers are free to use the programming language best suited for their solution. Python has typically been the go-to language when it comes to AI/ML solutions, but there is more flexibility in this area.

In this post you will see how to leverage the Go programming language to use Vector Databases and techniques such as Retrieval Augmented Generation (RAG) with langchaingo. If you are a Go developer who wants to how to build learn generative AI applications, you are in the right place!

If you are looking for introductory content on using Go for AI/ML, feel free to check out my previous blogs and open-source projects in this space.

First, let's take a step back and get some context before diving into the hands-on part of this post.

The limitations of LLMs

Large Language Models (LLMs) and other foundation models have been trained on a large corpus of data enabling them to perform well at many natural language processing (NLP) tasks. But one of the most important limitations is that most foundation models and LLMs use a static dataset which often has a specific knowledge cut-off (say, January 2022).

For example, if you were to ask about an event that took place after the cut-off, date it would either fail to answer it (which is fine) or worse, confidently reply with an incorrect response - this is often referred to as Hallucination.

We need to consider the fact that LLMs only respond based on the data they were trained on - it limits their ability to accurately answer questions on topics which are either specialized, or proprietary. For instance, if I were to ask a question about a specific AWS service, the LLM may (or may not) be able to come up with an accurate response. Wouldn't it be nice if the LLM could use the official AWS service documentation as reference?

RAG (Retrieval Augmented Generation) helps alleviate these issues

It enhances LLMs by dynamically retrieving external information during the response generation process, thereby expanding the model's knowledge base beyond its original training data. RAG-based solutions incorporate a vector store which can be indexed and queried to retrieve the most recent and relevant information, thereby extending the LLM's knowledge beyond its training cut-off. When an LLM equipped with RAG needs to generate a response, it first queries a vector store to find relevant, up-to-date information related to the query. This process ensures that the model's outputs are not just based on its pre-existing knowledge but are augmented with the latest information, thereby improving the accuracy and relevance of its responses.

But, RAG is not the only way

Although this post focuses solely on RAG, there are other ways to work around this problem, each with its pros and cons:

Task-Specific tuning: Fine-tuning large language models on specific tasks or datasets to improve their performance on those domains.
Prompt Engineering: Carefully designing input prompts to guide language models towards desired outputs, without requiring significant architectural changes.
Few-Shot and Zero-Shot Learning: Techniques that enable language models to adapt to new tasks with limited or no additional training data.

Vector Store and Embeddings

I mentioned vector store a few times in the last paragraph. These are nothing but databases that store and index vector embeddings, which are numerical representations of data such as text, images, or entities. Embeddings help us go beyond basic search since they represent semantic meaning of the source data - hence the word Semantic search, which is a technique that understands the meaning and context of words to improve search accuracy and relevance. Vector databases can also store metadata, including references to original data source (example, URL of a web document) of the embedding.

Thanks to generative AI technologies, there has also been an explosion in Vector Databases. These include established SQL and NoSQL databases that you may already be using in other parts of your architecture - such as PostgreSQL, Redis, MongoDB and OpenSearch. But there also database that are custom-built for vector storage. Some of these include Pinecone, Milvus, Weaviate,, etc.

Alright, let's go back to RAG...

What does a typical RAG workflow look like?

At a high level, RAG-based solutions have the following workflow. These are often executed as a cohesive pipeline:

Retrieving data from a variety of external sources like documents, images, web URLs, databases, proprietary data sources, etc. This consists of sub-steps such as chunking which involves splitting up large datasets (e.g. a 100 MB PDF file) into smaller parts (for indexing).
Create embeddings - This involves using an embedding model to convert data into their numerical representations.
Store/Index embeddings in a vector store

Ultimately, this is integration as part of a larger application where the contextual data (semantic search result) is provided to LLMs (along with the prompts).

End-to-end RAG workflow in action

Each of the workflow steps can be executed with different components. The ones used in the blog include:

PostgreSQL - It will be uses as a Vector Database, thanks to the pgvector extension. To keep things simple, we will run it in Docker.
langchaingo - It is a Go port of the langchain framework. It provides plugins for various components, including vector store. We will use it for loading data from web URL and index it in PostgreSQL.
Text and embedding models - We will use Amazon Bedrock Claude and Titan models (for text and embedding respectively) with langchaingo.
Retrieval and app integration - langchaingo vector store (for semantic search) and chain (for RAG).

You will get a sense of how these individual pieces work. We will cover other variants of this architecture in subsequent blogs.

Before you begin

Make sure you have:

Go, Docker and psql (for e.g., using Homebrew if you're on Mac) installed.
Amazon Bedrock access configured from your local machine - Refer to this blog post for details.

Start PostgreSQL on Docker

There is a Docker image we can use!

Activate pgvector extension by logging into PostgreSQL (using psql) from a different terminal:

Load data into PostgreSQL (Vector Store)

Clone the project repository:

At this point, I am assuming that your local machine is configured to work with Amazon Bedrock

The first thing we will do is load data into PostgreSQL. In this case, we will use an existing web page as the source of information.

I have used https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-general-nosql-design.html - but feel free to use your own! Make sure to change the search query accordingly in the subsequent steps.

You should get the following output:

Give it a few seconds. Finally, you should see this output if all goes well:

To verify, go back to the psql terminal and check the tables:

You should see couple of tables - langchain_pg_collection and langchain_pg_embedding. These are created by langchaingo since we did not specify them explicitly (that's ok, it's convenient for getting started!). langchain_pg_collection contains the collection name while langchain_pg_embedding stores the actual embeddings.

You can introspect the tables:

You will see 23 rows in the langchain_pg_embedding table, since that was the number of langchain documents that our web page source was split into (refer to the application logs above when you loaded the data)

A quick detour into how this works...

The data loading implementation is in load.go, but let's look at how we access the vector store instance (in common.go):

pgvector.WithConnectionURL is where the connection information for PostgreSQL instance is provided
pgvector.WithEmbedder is the interesting part, since this is where we can plug in the embedding model of our choice. langchaingo supports Amazon Bedrock embeddings. In this case I have used Amazon Bedrock Titan embedding model.

Back to the loading process in load.go. We first get the data in form of a slice of schema.Document (getDocs function) using the langchaingo in-built HTML loader for this.

Then, we load it into PostgreSQL. Instead of writing everything by ourselves, we can use the langchaingo vector store abstraction and use the high level function AddDocuments:

Great. We have set up a simple pipeline to fetch and ingest data into PostgreSQL. Let's make use of it!

Execute Semantic Search

Let's ask a question. I am going with "what tools can I use to design dynamodb data models?" relevant to this document which I used as the data source - feel free to tune it as per your scenario.

You should see a similar output - note that we opted to output a maximum of three results (you can change it):

Now what you see here is the top three results (thanks to -maxResults=3).

Note that this is not an answer to our question. These are the results from our vector store that are semantically close to the query - the key word here is semantically

Thanks to the vector store abstraction in langchaingo, we were able to easily ingest our source data into PostgreSQL and use the SimilaritySearch function to get the top N results corresponding to our query (see semanticSearch function in query.go):

Note that (at the time of writing) the pgvector implementation in langchaingo uses cosine distance vector operation but pgvector also supports L2 and inner product - for details, refer to the pgvector documentation.

Ok, so far we have:

Loaded vector data
Executed semantic search

This is the stepping stone to RAG (Retrieval Augmented Generation) - let's see it in action!

Intelligent search with RAG

To execute a RAG-based search, we run the same command as above (almost), only with a slight change in the action (rag_search):

Here is the output I got (might be slightly different in your case):

As you can see, the result is not just about "here are the top X response for your query". Instead, it's a well formulated response to the question. Let's peek behind the scene again to see how it works.

Unlike, ingestion and semantic search, RAG-based search is not directly exposed by the langchaingo vector store implementation. For this, we use a langchaingo chain which takes care of the following:

Invokes semantic search
Combines the semantic search along with a prompt
Sends it to a Large Language Model (LLM), which in this case happens to be Claude on Amazon Bedrock.

Here is a what the chain looks like (refer to function ragSearch in query.go):

Let's try another one

This was just one example. I tried a different question and increased maxResults to 10, which means that the top 10 results from the vector database will be used to formulate the answer.

The result (again, it might be different for you):

Where to "Go" from here?

Learning by doing is good approach. If you've followed along and executed the application thus far, great!

I recommend you try out the following:

langchaingo has support for lots of different model, including ones in Amazon Bedrock (e.g. Meta LLama 2, Cohere, etc.) - try tweaking the model and see if it makes a difference? Is the output better?
What about the Vector Database? I demonstrated PostgreSQL, but langchaingo supports others as well (including OpenSearch, Chroma, etc.) - Try swapping out the Vector store and see how/if the search results differ?
You probably get the gist, but you can also try out different embedding models. We used Amazon Titan, but langchaingo also supports many others, including Cohere embed models in Amazon Bedrock.

Wrap up

This was a simple example for you to better understand the individual steps in building RAG-based solutions. These might change a bit depending on the implementation, but the high-level ideas remain the same.

I used langchaingo as the framework. but this doesn't always mean you have to use one. You could also remove the abstractions and call the LLM platforms APIs directly if you need granular control in your applications or the framework does not meet your requirements. Like most of generative AI, this area is rapidly evolving, and I am optimistic about having Go developers having more options build generative AI solutions.

If you've feedback, questions or you would like me to cover something else around this topic, feel free to comment below!

Happy building!

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.