GenAI and the Data Lake

As large language models (LLMs) start appearing in more and more applications, I've heard many questions about what role an LLM plays in a data lake. In this article, I'll look at how an LLM can act as a data consumer, a data producer, and a data agent.

An LLM as a data consumer

Machine learning (ML) solutions, like other data products, can be viewed as one of the consumers of the data in a data lake. In other words, you may use data from the data lake to fine-tune a model or provide additional context using the retrieval augmented generation (RAG) pattern. Here’s a diagram showing an ingest pipeline for using PDF documents for RAG.

If your text data is stored in another format, like in an OpenSearch index or in a set of Parquet tables, you can simply adjust the ingest part of this solution.

This pattern is very familiar, as traditional ML models also consume data for model training. There are a couple of differences though. First, most companies are using their data to fine-tune a model, not to pre-train it from a very low level. Second, the ingest pipeline for RAG looks different from traditional ETL pipelines. The source data is often documents like PDF files or information extracted from another system via an API. The bottleneck in these pipelines is usually the embedding model rather than the batch processing system.

You may ask when it makes sense to use data in a data lake to fine-tune an LLM. Consider an example when you have a history of conversations between a customer and a call center agent stored in your data lake. You could use these conversations to fine-tune your model to understand more about the language specific to your products. If you have product manuals stored in your data lake, you can retrieve facts from those manuals using the RAG pattern to help answer questions.

LLM as a data producer

From another perspective, you may want to capture the outputs from your GenAI solution back into the data lake. For example, you may want to capture the questions posed to a chatbot so you can see which topics are causing the most confusion for customers. Or, you may capture operational data like the number of calls to the model and the inference latency. You can capture this information via CloudWatch, Kinesis, or other tools.

The RAG diagram we looked at earlier also shows this pattern. The box on the right shows how we capture embeddings into our data lake and analyze them with a Glue job to detect embedding drift.

You can also use an LLM to create synthetic data used to train or fine-tune other models or feed into other business systems.

LLM as an agent

Finally, a model can interact with the data lake by through the agent/tool pattern. when a person asks questions, the LLM translates these into queries that execute with a tool like Athena. The LLM will then take the query results and assemble a final answer to the question. The agent orchestration is provided by a framework like Bedrock agents or langchain.

The flow looks like this.

The diagram below shows how to implement that example using Bedrock and Athena.

Data lake agent and tool with Bedrock and Athena

This pattern is quite useful, as it helps less technical users work with a data lake using natural language. For example, let's again consider a data lake with product review information. You can ask questions like "How many negative reviews did we get last month?" and the LLM will work with Athena to answer that question.

Using an LLM as an agent requires careful consideration of security boundaries and performance. When you have an agent acting on a person’s behalf, the agent may accidentally create an incorrect upsert statement that overwrites valid data. Or, an agent may create a poor query statement that puts too much load on a data warehouse. You should evaluate whether you need a less permissive access model when an agent is acting on behalf of a person, and carefully monitor what agents are doing and how they are affecting the data lake.

You should also inspect your automated and manual testing processes. If an LLM makes a mistake, it may be quite different from the type of mistake that a person makes. Incorporate chaos engineering to test hypotheses about what an agent might do. Consider also that a business user who is not familiar with the data in your data lake might not know enough to question output that is not quite right.

Concerns and next steps

In this article you've seen how an LLM can act as a data consumer, a data producer, and a data agent. An LLM acts as a data consumer during fine-tuning, increasing its effectiveness by giving it access to more data. An LLM acts as a producer during the operational part of its lifecycle; you can store the evolution of prompts, responses, and operational metrics in your data lake. An LLM can also generate synthetic data to improve other systems.

Finally, an LLM can act as an agent, letting other users or programs interact with a data lake in a simpler way. This is the paradigm that requires the most inspection, as it presents different security and performance concerns. It’s worth revising the well-architected principals for an application that uses an agent to work with a data lake, and make sure your various types of test coverage (automated, manual, and chaos engineering) are sufficient.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Site Terms, Privacy, and more.

GenAI and the Data Lake

What role does a large language model play in a data lake?

An LLM as a data consumer

LLM as a data producer

LLM as an agent

Concerns and next steps

Comments