Supercharge Your Tabular Data with Pandas and Generative AI
What if you could add the power of Generative AI, Amazon Bedrock, and Amazon Nova foundation models to your data workflows?
Laith Al-Saadoon
Amazon Employee
Published Dec 16, 2024
Last Modified Dec 17, 2024
Pandas is the go-to Python library for data analysis, enabling you to slice, dice, and transform your data at scale. But what if you could go a step further and add the power of Generative AI to your data workflows?
A few people asked me how I sling tabular data, Pandas, and generative AI. So I thought I'd write a post about it.
In this post, I'll walk through how to combine Pandas with a generative AI model served through Amazon Bedrock and orchestrated by LangChain AWS. You will start by installing necessary packages, then I'll show you how to create a sample dataset, do some feature engineering, and finally apply generative AI transformations to your
DataFrame
. By the end, you’ll be able to supercharge your data analysis with machine-generated insights, summaries, and even structured output to extract new columns from rich text columns.- Python 3.7+
- AWS Credentials with proper permissions to access Amazon Bedrock (if you’re running locally)
- Jupyter Notebook (optional but recommended for interactivity)
- Familiarity with Pandas (if you're not, I recommend reading the Pandas documentation)
- Enabled access to the foundation model (Amazon Nova Lite) in your AWS account. See Amazon Bedrock for more information.
%pip install langchain-aws pandas langchain-core boto3 pydantic
We're going to create a synthetic dataset with 10 rows and 8 columns. We'll use the
numpy
library to generate random data for each column.Don't worry to much about the code for generating the dataset. We'll use the data later to apply generative AI transformations to the dataset. At this step, you would otherwise use
pd.read_csv()
or another Pandas method to read your structured data.In brief, we're creating a dataset with the following columns:
bedrooms
: Randomly generated number of bedrooms between 1 and 5bathrooms
: Randomly generated number of bathrooms between 1 and 3sqft
: Randomly generated square footage between 500 and 3500location
: Randomly generated location from a list of citiesview
: Randomly generated view from a list of viewsstyle
: Randomly generated style from a list of stylesyear_built
: Randomly generated year built between 1900 and 2024condition
: Randomly generated condition from a list of conditionsparking
: Randomly generated parking from a list of parking optionspool
: Randomly generated pool from a list of pool optionsprice
: Loosely dependent on the other features of the house
Your dataset will look different if you change the random seed.
The price of each property is calculated using the following formula:
f(price) = base_price × location_multiplier × view_multiplier × condition_multiplier
Let's add a derived feature to the dataset. We'll add a new column called
price_category
that categorizes the house price into three categories: Low
, Medium
, and High
automatically.We'll use ChatBedrockConverse from LangChain AWS to call Amazon Bedrock and use Amazon Nova Lite as the LLM.
Let’s say we want to use the model to generate human-readable property descriptions based on the features. We can define a function that takes a single DataFrame row, crafts a prompt, sends it to the LLM, and returns a generated description. We'll also use our global statistics to provide context to the LLM about overall trends in the dataset.
We'll use pandas
apply
to apply the function to each row in the DataFrame.Let's check out the result!
What if you want to generate insights for groups of data? For example, you might want to generate a summary of the dataset by location. You can use the
groupby
method to group the data by location and then apply
to use a generative AI transformation for each group.We’ll compute some global statistics about the dataset to provide context to the LLM. This will help our LLM produce richer summaries that consider how each group compares to the entire dataset.
Let's create a new function that takes a group of data and generates a summary of the group. We'll use the
groupby
method to group the data by location and then apply
to use a generative AI transformation for each group.Drum roll...
Let's try another example. What if we want to generate a summary of the dataset by price category?
Let's say we have a column with a long-form text description of a property and we want to extract structured data from it. We can use a generative AI model to do this. Conceptually, we will go the opposite direction of the previous examples. Instead of generating a summary or description, we will generate structured data from a text column.
Let's start by reviewing our description column and pretend that we are starting from only the description column.
We can use a generative AI model to extract structured data from the text column. We'll use the
ChatBedrockConverse
and Pydantic
to do this.And check out the results!
In this post, you've seen how to use Pandas and Generative AI with Amazon Bedrock to supercharge your data analysis. You've seen how to generate summaries, descriptions, and structured data from a dataset. You've also seen how to use
groupby
to generate insights for groups of data.I hope you found this post useful for your own projects. If you have any questions, please feel free to reach out to me on LinkedIn.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.