AWS Logo
Menu
Supercharge Your Tabular Data with Pandas and Generative AI

Supercharge Your Tabular Data with Pandas and Generative AI

What if you could add the power of Generative AI, Amazon Bedrock, and Amazon Nova foundation models to your data workflows?

Laith Al-Saadoon
Amazon Employee
Published Dec 16, 2024
Last Modified Dec 17, 2024
Pandas is the go-to Python library for data analysis, enabling you to slice, dice, and transform your data at scale. But what if you could go a step further and add the power of Generative AI to your data workflows?
A few people asked me how I sling tabular data, Pandas, and generative AI. So I thought I'd write a post about it.
In this post, I'll walk through how to combine Pandas with a generative AI model served through Amazon Bedrock and orchestrated by LangChain AWS. You will start by installing necessary packages, then I'll show you how to create a sample dataset, do some feature engineering, and finally apply generative AI transformations to your DataFrame. By the end, you’ll be able to supercharge your data analysis with machine-generated insights, summaries, and even structured output to extract new columns from rich text columns.

Prerequisites

  • Python 3.7+
  • AWS Credentials with proper permissions to access Amazon Bedrock (if you’re running locally)
  • Jupyter Notebook (optional but recommended for interactivity)
  • Familiarity with Pandas (if you're not, I recommend reading the Pandas documentation)
  • Enabled access to the foundation model (Amazon Nova Lite) in your AWS account. See Amazon Bedrock for more information.

Step 1: Install the necessary packages:

%pip install langchain-aws pandas langchain-core boto3 pydantic

Step 2: Setup a synthetic dataset

We're going to create a synthetic dataset with 10 rows and 8 columns. We'll use the numpy library to generate random data for each column.
Don't worry to much about the code for generating the dataset. We'll use the data later to apply generative AI transformations to the dataset. At this step, you would otherwise use pd.read_csv() or another Pandas method to read your structured data.
In brief, we're creating a dataset with the following columns:
  • bedrooms: Randomly generated number of bedrooms between 1 and 5
  • bathrooms: Randomly generated number of bathrooms between 1 and 3
  • sqft: Randomly generated square footage between 500 and 3500
  • location: Randomly generated location from a list of cities
  • view: Randomly generated view from a list of views
  • style: Randomly generated style from a list of styles
  • year_built: Randomly generated year built between 1900 and 2024
  • condition: Randomly generated condition from a list of conditions
  • parking: Randomly generated parking from a list of parking options
  • pool: Randomly generated pool from a list of pool options
  • price: Loosely dependent on the other features of the house
Your dataset will look different if you change the random seed.
The price of each property is calculated using the following formula:
f(price) = base_price × location_multiplier × view_multiplier × condition_multiplier

Step 3: Feature engineering

Let's add a derived feature to the dataset. We'll add a new column called price_category that categorizes the house price into three categories: Low, Medium, and High automatically.

Step 4: Initialize the LLM model

We'll use ChatBedrockConverse from LangChain AWS to call Amazon Bedrock and use Amazon Nova Lite as the LLM.

Step 5: Generate a property description for each row

Let’s say we want to use the model to generate human-readable property descriptions based on the features. We can define a function that takes a single DataFrame row, crafts a prompt, sends it to the LLM, and returns a generated description. We'll also use our global statistics to provide context to the LLM about overall trends in the dataset.
We'll use pandas apply to apply the function to each row in the DataFrame.
Let's check out the result!

Step 6: Use groupby to generate insights for groups of data

What if you want to generate insights for groups of data? For example, you might want to generate a summary of the dataset by location. You can use the groupby method to group the data by location and then apply to use a generative AI transformation for each group.
We’ll compute some global statistics about the dataset to provide context to the LLM. This will help our LLM produce richer summaries that consider how each group compares to the entire dataset.
Let's create a new function that takes a group of data and generates a summary of the group. We'll use the groupby method to group the data by location and then apply to use a generative AI transformation for each group.
Drum roll...
Let's try another example. What if we want to generate a summary of the dataset by price category?

Bonus! Extract structured data from a long-form text column into new columns

Let's say we have a column with a long-form text description of a property and we want to extract structured data from it. We can use a generative AI model to do this. Conceptually, we will go the opposite direction of the previous examples. Instead of generating a summary or description, we will generate structured data from a text column.
Let's start by reviewing our description column and pretend that we are starting from only the description column.
We can use a generative AI model to extract structured data from the text column. We'll use the ChatBedrockConverse and Pydantic to do this.
And check out the results!

Conclusion

In this post, you've seen how to use Pandas and Generative AI with Amazon Bedrock to supercharge your data analysis. You've seen how to generate summaries, descriptions, and structured data from a dataset. You've also seen how to use groupby to generate insights for groups of data.
I hope you found this post useful for your own projects. If you have any questions, please feel free to reach out to me on LinkedIn.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments