
Best practices for text-to-SQL use cases with LLMs
Get better results with smaller, off-the-shelf LLMs by improving your schema context and prompt engineering
- Start from scratch with a blank document. Don’t try to change a big prompt you’ve been working on. The existing complexity might be part of the problem. Simple is always best for prompt engineering. Save it for future reference, though.
- Dump the database with a tool like mysqldump that will list all of the CREATE TABLE statements to describe all required tables. Don’t dump the data rows.
- If there are ALTER TABLE statements for the foreign key descriptions in your database dump, convert the foreign key descriptions to use the references table_name (column) format inside the create table statements instead. Delete all the statements except for CREATE TABLE statements from the DDL.
- If there are any columns or tables that are not named in a self-explanatory way, add SQL comments above the columns or tables to describe them. Not every column or table…only the non-obvious ones.
- If there are columns that have a limited set of values, use a CHECK constraint inside the create table statement, instead of some other narrative format.
- DO NOT add any example queries until you need them. You don’t need them until you identify categories of errors that are arising from the model responses. Then you need to decide how best to fix that error categorically. Maybe it’s by an example for a complex join, but maybe it’s just from fixing or simplifying the prompt a different way. If you need an example, start by adding one example that addresses the category of failures, and see if you can eliminate those categorical errors with a single example. The fewer the better.
- If you have a large number of tables, don’t just use all the table descriptions in every LLM invocation. Instead, group the tables together that you need joined together for certain types of user queries. This ensures you’re not missing critical tables for the join at inference time and you fetch only the top few schema chunks from the vector database. As a corollary, if you want accuracy in a gen AI workload, use retrieval augmented generation (RAG). By themselves, these models work the same way the next word prediction on your phone does. It’s word association, not fact memorization. If you need facts, look them up and let the model read through them to find the answer: use RAG.
- Put each group of tables into its own document, and ingest those documents into your database without further chunking. This means if you have large groups of tables you may need an embedding model that has a large context window. For example, 300 or 500 tokens may not be enough for dozens of tables. I like Titan Embeddings v2 on AWS, which offers a context window of up to 8192 tokens. If using Bedrock Knowledge Bases, I select the option of no chunking when I’m doing text-to-SQL context ingestion, so that I’ll be sure that the chunks stay the way I intended, and just keep the context less than 8,192 tokens for each group of tables per chunk.
- At inference time, you should be able to pull back just the top few schema chunks from the vector database that match the user’s incoming query (or maybe even just the top one), and they should have the right tables in them needed to answer the user’s question. If you’re having a hard time getting the user’s questions to match the right chunks of database schema, then add SQL comments to the top of the chunk that match the questions you expect that should go against those tables. Only put minimal commentary.
- When you bring back multiple chunks, they may have the same table listed more than once. That’s OK, but it would be optimal to remove those duplicates. If you want to do so (I’m including this extra detail because someone asked me), concatenate them all together, then split on the characters in between the create table statements. For example, splitting on the “);” at the end of a create table statement would result in an array of create table statements that were missing those last two characters. Some languages allow you to keep the separators. For Python, just append them back to the end of the string. Then chop anything off before the “create table” statement in each split (there might be comments). Now you have a list of create table statements. Just compare the create table table_name part, because the comments might be different across text chunks. Eliminate any duplicates and optionally keep the comments from the eliminated one in the final list. Rejoin the list into a string of unique create table statements, plus any comments you might want to keep for the final model invocation.
- Always separate long context blocks from the rest of the code by XML tags, like:
<SQL SCHEMAS>
(insert all your schema chunks )
</SQL_SCHEMAS>
It’s less critical when it’s obvious like CREATE TABLE statements, but when it’s text from a document, this helps the model realize where the context starts and stops and where the rest of the instructions in the template are, outside of those variable context sections. Also, models like Anthropic Claude were heavily trained with XML tags like this, so using it to distinguish between the prompt template and the variable context is helpful. No need to overdo it everywhere in the prompt template, though. That just adds complexity. - The biggest rule when prompt engineering is keep it as simple as possible, start with small successes, and layer functionality into your prompt template. Don’t start with a giant prompt where you ask it to do a whole list of things right from the start. Incremental prompt engineering makes it easier to develop your intuition of what works, and how to explain to the model in as simple a manner as possible. Don’t preemptively assume it will need lengthy instructions until you try with simpler instructions first.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
You're an expert at SQL. You will be given a user's natural language prompt, and a set of table definitions.
Here's the user's prompt:
<USER_PROMPT>
Please list all of the customers' states that a specific vendor has shipped their products to.
</USER_PROMPT>
<SQL_SCHEMA>
-- This is a table schema to be used when a user prompts to generate SQL select queries.
-- The users will not expect you to return data, just the SQL queries that they can review and later execute.
CREATE TABLE customers (
id INT,
first_name VARCHAR(50),
last_name VARCHAR(50),
email VARCHAR(100),
phone VARCHAR(20),
street_address VARCHAR(100),
city VARCHAR(50),
state VARCHAR(2),
zip VARCHAR(10),
country VARCHAR(50)
);
CREATE TABLE vendors (
id INT,
first_name VARCHAR(50),
last_name VARCHAR(50),
business_name VARCHAR(100),
email VARCHAR(100),
phone VARCHAR(20),
street_address VARCHAR(100),
city VARCHAR(50),
state VARCHAR(2),
zip VARCHAR(10),
country VARCHAR(50)
);
CREATE TABLE product_categories (
name VARCHAR(255),
example_goods VARCHAR(255)
);
CREATE TABLE products (
id INTEGER,
name TEXT,
product_category TEXT references product_categories(name),
description TEXT,
price REAL,
vendor_id INTEGER references vendors(id)
);
CREATE TABLE orders (
id INT,
customer_id INT references customer(id),
order_date DATE,
delivery_date DATE
);
CREATE TABLE order_items (
id INT,
order_id INT references orders(id)
product_id INT references products(id),
line_item_price FLOAT,
line_item_discount FLOAT,
quantity INT
);
</SQL_SCHEMA>
Please return your response as a SQL SELECT statement without additional narrative before or after.
<SQL_STATEMENT>
<SQL_STATEMENT>
tag without a closing tag is called "putting words in Claude's mouth" (or any model's mouth, I suppose), and it helps to reinforce the previous instruction to avoid additional narrative. Then, if your model accepts stop sequences like Claude, you can use </SQL_STATEMENT>
as a stop sequence.1
2
3
4
5
6
7
8
SELECT DISTINCT c.state
FROM customers c
INNER JOIN orders o ON c.id = o.customer_id
INNER JOIN order_items oi ON o.id = oi.order_id
INNER JOIN products p ON oi.product_id = p.id
INNER JOIN vendors v ON p.vendor_id = v.id
WHERE v.id = (SELECT id FROM vendors WHERE <VENDOR_ID_CONDITION>);
</SQL_STATEMENT>
</SQL_STATEMENT>
text and run the statement. Note that it successfully completes a five-table join out of the six tables provided in that database schema example above, without any additional narrative description of the tables or relationships, just the CREATE TABLE
DDL. It also left the <VENDOR_ID_CONDITION> tag for convenient programmatic substitution, since I didn't specify the vendor selection criteria to begin with.AWS Generative AI Sr. Solutions Architect, North America
https://linkedin.com/in/davetbo
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.