AWS Logo
Menu
Enhancing Query Generation for LangChain

Enhancing Query Generation for LangChain

In "Exploration of Prompting Strategies for GraphRAG applications", we can solve more questions by enhancing code found in LangChain.

Brian O'Keefe (AWS)
Amazon Employee
Published May 12, 2025
At The Knowledge Graph Conference 2025, I delivered a new workshop I had written titled "Exploration of Prompting Strategies for GraphRAG applications". In that workshop, I had a challenge to improve our prompts for graph extraction and query generation to answer 3 questions that were not answered sufficiently. As promised to attendees, I would post one of those methods here.
First, some background. Graph query generation using LLMs generally works by giving the LLM the question you want to answer and the schema of the graph, and then it generates a query to answer it. The code that generates this schema from Amazon Neptune when using LangChain can be found in neptune_graph.py. In the workshop, I had the core functionality modified to run in a Jupyter notebook cell instead of the framework, and it produced the following (shortened for brevity):
The gap here is that when you ask questions where the generated query requires a WHERE clause or filter containing categorical data, the LLM doesn't know what the values to filter on should be. Consider these two questions from the workshop:
"I am currently in Greenwich Village. Is there a recommended restaurant nearby to eat at?"
"I'm currently at St Patrick's Cathedral and I'm hungry and I want to shop within several stops of there. Where should I go?"
In both of these scenarios, the LLM generates a query that results in no answers...partially because of a modeling decision made. The respective queries one run through the workshop generates are:
In the first query, it actually guessed the subtype correct, but we modeled our subtype values as uppercase. The query generated was not case insensitive, and therefore no results were found. In the latter case, most of the subtypes do not exist.
So before I share the modification that will improve results, let me stress this point...this code is provided as a sample for demonstration purposes and is not tested for production environments. In fact, I will say with some certainty that adding this to your query generation code will not scale for large graphs with a lot of nodes and a large number of distinct labels and properties. That being said, the solution is to include the potential values for categorical data in the schema. So the schema snippet for label: BUSINESS
becomes
and we modify the CYPHER_GENERATION_PROMPT variable from
to
In my example, I used the arbitrary cutoff that categorical data exists if a property is type "STRING" with 10 or fewer distinct values across the nodes with that label. To generate this schema, I added the following functions
and modified the _get_node_properties to utilize those functions
With this additional information in the schema, the LLM now generates proper queries that result in answers, such as:
Importantly, the LLM no longer has to guess which values are appropriate for filtering. The values are known because the LLM is told what they are.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments