How to get structured output from LLM's - A practical guide
Integrating LLMs into production systems requires structured, reliable output. In this post, you’ll learn three practical approaches to enforce it: Prompt Engineering, Function Calling, and Output Validation.
Katja
Amazon Employee
Published May 13, 2025
Working with Large Language Models (LLMs) has become business as usual - but integrating them into reliable, production-grade systems remains challenging. A crucial aspect of this integration is ensuring that LLM output adheres to a predefined structure - particularly critical for backend processes like data pipelines, API orchestration, and multi-agent systems.
Raw LLM output presents several challenges: they may be inconsistent, get cut off unexpectedly, vary between calls, or deviate from expected format. Having worked on Intelligent Document Processing (IDP) projects, I've seen how crucial it is to obtain responses in specific schemas. Beyond IDP, structured outputs are essential for:
- Data Extraction & Integration - Converting unstructured content into standardized formats for databases, APIs, and automated workflows
- Multi-Agent Systems - Enabling reliable communication between AI agents through well-defined message formats
- User Interface Integration - Ensuring LLM outputs can be consistently parsed and displayed in front-end applications
In this post, we’ll explore practical strategies for generating structured outputs from LLMs - from simple prompt engineering to advanced frameworks that combine tool use and schema validation (e.g., Pydantic). We'll examine the trade-offs of each approach and when to use them.
There are several approaches to obtaining structured output from LLMs, each with its own advantages and use cases.
- Prompt Engineering is the most naive and simplest approach, requiring no additional tooling. It works well for straightforward, consistent schemas but is more prone to errors and inconsistencies. This method is best suited for rapid prototyping and simple applications where reliability isn't critical.
- Function Calling/Tool Use represents a more robust approach by providing the model with a predefined API schema. It's valuable for complex applications and multi-step processes where precision is essential.
- Output Parsers and Validators, such as Pydantic, provide schema validation against a predefined schema, helping catch and handle malformed outputs. They enable easier integration with existing code bases and are ideal for production systems where reliability is paramount.
Remember that these methods are not mutually exclusive - but can and should rather be used in combination.
Just as important as how we structure the output is what form it takes. Common output formats include:
- JSON - standard format for API integrations and complex nested structures.
- XML - important for enterprise systems and legacy integrations.
- Key-value pairs - ideal for simple, flat data structures and configuration settings. Scenarios where data maps directly to table columns or UI components.
The choice (& combination) of format and method depends on your use case, integration requirements, data complexity, and existing system constraints. In the following section, we’ll dive deeper into each of the methods.
The most straightforward approach to getting structured output from an LLM is through careful prompt engineering. While this method might seem basic, it can be effective when done right. There are three main techniques to consider:
- Clear format specification
- Response prefilling
- Few-shot learning (examples)
1/ Clear format specification
Instructing the model to return output in a specific format - typically JSON - is the most direct (& naive) technique. For example:
This works but can be unreliable. Common issues include incomplete JSON structures, added explanations outside the JSON, or deviating from the specified format.
To improve reliability, you can be even more explicit:
2/ Response prefilling
Another option is to prefill the model’s expected response. Different LLMs handle response prefilling (or 'prompt templates') in varying ways - it could look like:
or:
Similarly, if you are working with Amazon Nova models, steer the model to respond in JSON by prefilling the assistant message with
```json
and add a stop sequence on ```
.By partially supplying the response format, you bias the model toward structured completion. The exact syntax varies by provider and model, so it's important to test how your chosen model handles response prefilling.
3/ Few-shot learning
Providing one or more examples to the model (few-shot learning) helps improve output consistency. Examples show the model exactly what kind of structure and content you expect. This is especially useful for more ambiguous or variable documents.
For Anthropic Claude 3 users, check out the prompt engineering workshop on additional best-practices.
While prompt engineering alone can go a long way, you can pair it with libraries that repair or enforce structure. One example is json_repair, a Python library that fixes common JSON syntax errors like unclosed brackets or missing commas.
While prompt engineering is the simplest approach, it's important to note its limitations. Without additional validation, you might need to handle edge cases and errors in your application code. For production systems, consider combining this approach with output validators or moving to more robust methods like function calling.
Function calling offers a more robust approach to extract structured output from LLMs. Unlike prompt-only approaches, this method defines a strict schema that the model must conform to. This approach is particularly powerful for complex applications where reliability is crucial.
Note: Function calling does not mean the model executes code or calls an API directly. Instead, it returns a structured payload based on the defined schema, which your application can then use as input to a real function, API call, or downstream system.
Let’s stay with our invoice extraction example - below is an example tool definition using Anthropic’s Claude models and the Amazon Bedrock ConverseAPI (which makes it easy to plug in different models supported by Amazon Bedrock):
This schema-based definition allows for rich, nested structures, type enforcement, and field-level descriptions. You can also define enums, constrain value types, and set required fields. The
tool_choice
parameter forces the model to use this specific tool—if omitted, the model will decide whether to use it based on its reasoning.When invoking an LLM, you would use the tool as follows:
This results in a structured JSON payload that aligns with the schema, which you can then feed into other parts of your system—such as a billing API or a database.
While function calling requires more setup than basic prompt engineering, it provides much better reliability and maintainability for production systems.
While prompt engineering and function calling can produce structured output, additional parsing and validation can enhance reliability and integration.
One of the most popular tools for this task is Pydantic, a Python library for data validation based on Python type annotations.
Pydantic makes it easy to define a schema and automatically validate incoming data. Let’s define a simple schema for our invoice use case:
You can then validate your LLM output like this:
You can even directly couple your LLM call with the pydantic model, with libraries like instructor. It basically acts as a wrapper that combines your LLM call with Pydantic validation.
Here’s an example using
instructor
with Anthropic via AWS Bedrock:You can then directly access validated fields of your
Invoice
object:This pattern not only enforces schema compliance but also gives you immediate access to strongly typed fields in your response.
Instructor will handle several key steps: converting your Pydantic model to a schema the LLM can understand, formatting the prompt appropriately for the provider, validating the LLM's response against your model, retrying automatically if validation fails, and returning a properly typed Pydantic object.
Instructor gives you much control on the retry strategy: observe how we set
max_retries
in our request to define the number of retries. It also gives you the possibility to catch retry exceptions, or use an additional library called Tenacity to define back-offs or additional retry logic.Note: The example shows how to use instructor with Anthropic models via AWS Bedrock, but you could similarly use it with other models like GPT-3.5, GPT-4, and open-source models including Mistral.
Similarly, you could use your pydantic model with frameworks like LangChain by using the
.with_structured_output()
method:In production systems, validation is essential - either as a fallback for prompt-based approaches or as a reinforcement for function calling.
Note: If you are building an agentic system, check out PydanticAI, a Python agent framework.
Constrained decoding (or constrained sampling) guides the LLM's output by limiting (constraining) which tokens it can choose from, restricting the LLM's next-token predictions to only those that will maintain the required structure. Check out this piece (section “Level 3 - Structured Output”) by my colleague Stephen Hibbert to see how it works in action. This technique requires access to the model's complete next-token probability distribution, making it only viable for locally-run models unless directly supported by the provider. OpenAI have wrote about how they implemented constrained decoding within their Structured Output feature.
Another similar approach I found is jsonformer, a Python library that wraps local Hugging Face models for structured decoding of a subset of the JSON Schema. Since many tokens in structured output are fixed and predictable (think about {, “, ] ), jsonformer will fill in fixed tokens during the generation process, so the model only needs to generate content tokens.
In this post, we explored three approaches for generating structured output from LLMs:
- Prompt Engineering provides a simple, lightweight solution ideal for early prototyping and basic applications.
- Function Calling/Tool Use offers a more robust approach by enforcing schema specifications through API definitions.
- Output Parsers and Validators like Pydantic add an extra layer of reliability through schema validation and type checking.
The choice between these methods depends on your specific needs - from quick prototypes to production-grade systems. For critical applications, combining approaches (like function calling with Pydantic validation) often provides the most reliable solution. Importantly, good prompt engineering remains fundamental across all approaches - even when using output parsing and validation, you still need clear prompts instructing the model to generate valid JSON matching your schema. The key advantage of layering these techniques is that tools like Pydantic provide detailed validation errors, while still relying on good prompts to help the LLM generate the right format in the first place.
While this post showcases some of the most important approaches in my opinion - many different tools and libraries exist. If you are looking for a more managed approach to getting structured output, check out services like Amazon Bedrock Data Automation, which combine traditional OCR with LLM-based processing.
Regardless of the chosen method, structured output is essential for successfully integrating LLMs into larger applications and workflows. Are you using other great libraries or frameworks to enforce structure in LLM outputs? Let me know in the comments!
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.