Best practices for text-to-SQL use cases with LLMs
Get better results with smaller, off-the-shelf LLMs by improving your schema context and prompt engineering
Dave Thibault
Amazon Employee
Published Nov 5, 2024
As a generative AI solutions architect for AWS, I handle a lot of specialist requests these days. Many of them are for text-to-SQL, and I’ve seen people use many different prompt engineering techniques when trying to create text-to-SQL chat experiences, with varying degrees of success.
The solutions I’ll share here have worked for and improved the results of my customers, and led to results approaching (or at) 100% for many use cases, without requiring the largest LLMs available or specialized models. I’ll also share reasons why these techniques seem to work better, which should help build your intuition for future prompt engineering tasks.
I recently had an “experience-based acceleration” (“EBA” in AWS-speak) where I helped a commercial real estate customer add global semantic search and text-to-SQL to their primary business platform. An EBA consists of a few days on-site with a customer, preceded by six to eight weeks of weekly meetings to make sure all the resources are aligned for the on-site event.
The prep goes into making sure the customer has the right access to accounts for the event, and the ability to stand up as many resources as possible in advance. Then, on the days of the on-site event, the AWS specialists and account team members advise the customers through building the proof-of-concept in their own accounts. At the end, a presentation of the PoC is given to the customer’s senior stakeholders, and a path to production is mapped out.
For the text-to-SQL part of the engagement, the customer wanted to start with the “properties” and “contacts” data domains, which is a many-to-many relationship, so requires a join table.
I thought, “Three tables. No problem.”
It turns out it was 18 tables in those two domains plus the join tables between them. Still no problem? Read on… :)
Going into the on-site engagement, the customer had been working on prompt engineering with some guidance from us, but was having mixed results (around 40% correct from their 30+ test cases). It turned out they were using a narrative style of describing the tables, giving a list of the table names and descriptions at the top, some table aliases they wanted the model to use, dozens of example SELECT statements, and finally a bulleted list of each of the tables with the names of columns, their descriptions, and possible values, if applicable.
It was a lot of text; too much text for the model to keep track. Granted, you can have huge context windows of 200K tokens, but that doesn’t mean you should, in terms of cost or performance. When I was reading through it, I found myself having to go back and re-read the parts I’d read before to remind myself what I’d read, because it was long and there was a lot of descriptive info in it.
Here’s a simple prompt engineering test: give your prompt to a friend and watch the look on their face as they read it. If their eyes keep needing to go back up to re-read, that’s a problem.
Here’s another test: would you expect a well-meaning but sometimes-overly-literal intern to be able to follow the instructions? If not, simplify.
If you think about the data that these LLMs are trained on, that gives them the ability to write SQL in the first place, it’s SQL documentation. In SQL documentation, tables are described with CREATE TABLE statements. Since that’s the most common format the models have seen, that will work best to describe the tables. It's also a lot shorter than alternative descriptive formats, so it will be faster and cheaper to run, in addition to more effective.
This advice generalizes to other topics as well. Think about the training data the models have seen for a special topic, and try to emulate that instead of a different format, whenever applicable.
Here’s what’s worked best for me and my customers.
- Start from scratch with a blank document. Don’t try to change a big prompt you’ve been working on. The existing complexity might be part of the problem. Simple is always best for prompt engineering. Save it for future reference, though.
- Dump the database with a tool like mysqldump that will list all of the CREATE TABLE statements to describe all required tables. Don’t dump the data rows.
- If there are ALTER TABLE statements for the foreign key descriptions in your database dump, convert the foreign key descriptions to use the references table_name (column) format inside the create table statements instead. Delete all the statements except for CREATE TABLE statements from the DDL.
- If there are any columns or tables that are not named in a self-explanatory way, add SQL comments above the columns or tables to describe them. Not every column or table…only the non-obvious ones.
- If there are columns that have a limited set of values, use a CHECK constraint inside the create table statement, instead of some other narrative format.
- DO NOT add any example queries until you need them. You don’t need them until you identify categories of errors that are arising from the model responses. Then you need to decide how best to fix that error categorically. Maybe it’s by an example for a complex join, but maybe it’s just from fixing or simplifying the prompt a different way. If you need an example, start by adding one example that addresses the category of failures, and see if you can eliminate those categorical errors with a single example. The fewer the better.
- If you have a large number of tables, don’t just use all the table descriptions in every LLM invocation. Instead, group the tables together that you need joined together for certain types of user queries. This ensures you’re not missing critical tables for the join at inference time and you fetch only the top few schema chunks from the vector database. As a corollary, if you want accuracy in a gen AI workload, use retrieval augmented generation (RAG). By themselves, these models work the same way the next word prediction on your phone does. It’s word association, not fact memorization. If you need facts, look them up and let the model read through them to find the answer: use RAG.
- Put each group of tables into its own document, and ingest those documents into your database without further chunking. This means if you have large groups of tables you may need an embedding model that has a large context window. For example, 300 or 500 tokens may not be enough for dozens of tables. I like Titan Embeddings v2 on AWS, which offers a context window of up to 8192 tokens. If using Bedrock Knowledge Bases, I select the option of no chunking when I’m doing text-to-SQL context ingestion, so that I’ll be sure that the chunks stay the way I intended, and just keep the context less than 8,192 tokens for each group of tables per chunk.
- At inference time, you should be able to pull back just the top few schema chunks from the vector database that match the user’s incoming query (or maybe even just the top one), and they should have the right tables in them needed to answer the user’s question. If you’re having a hard time getting the user’s questions to match the right chunks of database schema, then add SQL comments to the top of the chunk that match the questions you expect that should go against those tables. Only put minimal commentary.
- When you bring back multiple chunks, they may have the same table listed more than once. That’s OK, but it would be optimal to remove those duplicates. If you want to do so (I’m including this extra detail because someone asked me), concatenate them all together, then split on the characters in between the create table statements. For example, splitting on the “);” at the end of a create table statement would result in an array of create table statements that were missing those last two characters. Some languages allow you to keep the separators. For Python, just append them back to the end of the string. Then chop anything off before the “create table” statement in each split (there might be comments). Now you have a list of create table statements. Just compare the create table table_name part, because the comments might be different across text chunks. Eliminate any duplicates and optionally keep the comments from the eliminated one in the final list. Rejoin the list into a string of unique create table statements, plus any comments you might want to keep for the final model invocation.
- Always separate long context blocks from the rest of the code by XML tags, like:
<SQL SCHEMAS>
(insert all your schema chunks )
</SQL_SCHEMAS>
It’s less critical when it’s obvious like CREATE TABLE statements, but when it’s text from a document, this helps the model realize where the context starts and stops and where the rest of the instructions in the template are, outside of those variable context sections. Also, models like Anthropic Claude were heavily trained with XML tags like this, so using it to distinguish between the prompt template and the variable context is helpful. No need to overdo it everywhere in the prompt template, though. That just adds complexity. - The biggest rule when prompt engineering is keep it as simple as possible, start with small successes, and layer functionality into your prompt template. Don’t start with a giant prompt where you ask it to do a whole list of things right from the start. Incremental prompt engineering makes it easier to develop your intuition of what works, and how to explain to the model in as simple a manner as possible. Don’t preemptively assume it will need lengthy instructions until you try with simpler instructions first.
Leaving that last opening
<SQL_STATEMENT>
tag without a closing tag is called "putting words in Claude's mouth" (or any model's mouth, I suppose), and it helps to reinforce the previous instruction to avoid additional narrative. Then, if your model accepts stop sequences like Claude, you can use </SQL_STATEMENT>
as a stop sequence.If I run that populated prompt in the Amazon Bedrock Console, I get the following response:
Then I can remove the
</SQL_STATEMENT>
text and run the statement. Note that it successfully completes a five-table join out of the six tables provided in that database schema example above, without any additional narrative description of the tables or relationships, just the CREATE TABLE
DDL. It also left the <VENDOR_ID_CONDITION> tag for convenient programmatic substitution, since I didn't specify the vendor selection criteria to begin with.One last note is that you can also create simplified (possibly denormalized) views of multiple database tables to reduce complexity further for the LLM, reduce input token context length, and thereby reduce costs and improve performance and latency. But don't over-engineer from the beginning. Get it working like this, speed time to market with reasonable COGS ("cost of goods sold"), and start getting ROI faster, while leaving yourself a future opportunity for marginal cost optimization that leads to increased profits, lower prices for your customers, or both.
Using the process above, we were able to get the customer to 100% accuracy across their 30+ test cases by using all 18 tables for property and contact domains in a single chunk. To scale further, they plan on adding additional data domains per vector database schema chunk.
How did we measure the results? We used my friend, Justin Muller’s, great Medium article entitled Prompt Evaluation: Systematically testing and improving your Generative AI prompts at scale. Code included!****** I refer A LOT of customers to that article. It provides an example test harness for unit testing your prompts. I highly recommend that article to everyone who intends to put generative AI applications into production.
I’ve been using Claude 3 Haiku almost exclusively since it launched earlier in 2024, but this strategy should work best with other models as well (like Llama 3.2, the Mistral models, or other models in Bedrock), because of the reasoning shared above with respect to SQL training data. With the process above, you don’t need the biggest, most expensive models. You can use the smallest of the best modern models and get great results, if you keep it simple by using SQL DDL to describe your database schemas. In doing so, you also get cost and performance (latency) optimization.
Let me know what you think! I’d love to hear if this helps people, or if you find a way to improve on it. I’m always eager to get feedback, to confirm and/or improve my future advice for the benefit of customers.
Thanks for reading!
Dave Thibault
AWS Generative AI Sr. Solutions Architect, North America
https://linkedin.com/in/davetbo
AWS Generative AI Sr. Solutions Architect, North America
https://linkedin.com/in/davetbo
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.