Real-Time Sentiment Analysis Using Apache Kafka and SageMaker Jupyter Notebooks

Ever wondered how to enrich your real-time streaming data with machine learning-driven analysis? Whether it is event or anomaly detection, automated alerts and notifications, network or sentiment analysis, you can leverage the power of machine learning algorithms to gain valuable insights and make informed decisions. By integrating machine learning into your real-time streaming data pipeline, you open up a world of possibilities for all kinds of data analysis.

In this tutorial, we'll show you an example of using a model run by SageMaker and usage of SageMaker Jupyter Notebooks to enrich the streaming data from an Apache Kafka topic. In particular, we’ll be working with Mastodon social network records and as the data flows through Kafka, we'll be enriching it with positive and negative sentiment information in real-time. The outcome will be stored in two separate topics: one for successful enrichment and another one for any issues that happened during analysis.

What you will learn

How to establish connection between an AWS SageMaker Jupyter Notebook and an Apache Kafka cluster.
How to consume data from an Apache Kafka topic and enhance it using an AI model that runs on SageMaker.
How to produce data back to Apache Kafka from the SageMaker Jupyter Notebook.

Prerequisites

About
✅ AWS Level	Intermediate - 200
⏱ Time to complete	20 mins
🧩 Prerequisites	- AWS account and access to SageMaker Studio. Follow this link to set up your AWS account if you don’t have any. - Apache Kafka. We’ll be using Aiven for Apache Kafka cluster, which you can create at no cost for this tutorial as part of a free trial. Follow this link to register for Aiven for Apache Kafka. - Data for sentiment analysis. Follow steps described in this article to stream mastodon data to Apache Kafka topic.
💻 Code Sample	Code sample used in tutorial on GitHub
💰 Cost to complete	Free when using the AWS Free Tier and Aiven for Apache Kafka
🔨 Services used	- SageMaker
⏰ Last Updated	2023-12-11

Step 1 - Set up an Apache Kafka cluster

Secure connection between AWS SageMaker and Apache Kafka

To access an Apache Kafka cluster from a Sagemaker Jupyter Notebook and establish a TLS connection you need to have this data from your cluster:

service URI of your Apache Kafka cluster
files containing access key, access certificate and CA certificate of your Apache Kafka services.

All of those you can get from Aiven for Apache Kafka service landing page.

Get credentials to access Aiven for Apache Kafka

Keep this data ready, we'll use it in the steps below.

Apache Kafka topics for input and output data

We'll need three topics in your Apache Kafka cluster:

social_media_messages - this is the topic that contains data to be analised.
enriched_data - this topic will be filled in with the analysis result
processing_errors - this topic will be collecting information on the errors during data analysis.

When using Aiven for Apache Kafka you can do this via console interface:

Aiven platform interface to add a new topic

Once done you should see three topics ready:

Input data generation

We recommend you use JavaScript-based data generator to get a stream of social media data. However, you can also stream any other data that you want to perform analysis on, just make sure to adjust the property names in the code.

Step 2 - Set up Sagemaker and a Jupyter Notebook

To access to SageMaker studio you'll need to create a domain first:

Select Set up for single user (Quick setup) and wait till a new domain is established. Once all is ready click to open the studio:

Once you'll land into Amazon SageMaker Studio, select JumpStart:

Select JumpStart in Amazon SageMaker Studio

Here you can search for one of already existing and trained models. We want a model that is trained to do text classifications. Different models have different advantages and disadvantages, select a model taking into account type of text you want to analyse and the length of text snippet. Social media messages from Mastodon are quite short, so we'll go for a model "DistilBERT Base Multilingual Cased". You can find it by using the search bar. Select the one for text classification:

Click to deploy the model and wait till it has been prepared:

When the model is deployed, open the demo notebook. You'll be prompted to set up notebook environment, you can agree to suggested values:

In this demo notebook you can make test requests for text classification. We'll be modifying the existing instructions as the part of the next step.

Step 3 - Write code

Now it is time to write the code! Feel free to clean the existing cells, we'll replace the example code with our own logic.

If you want to skip explanations and just to copy the result , load the final notebook code (don't forget to add the information about your Kafka instance!). Alternatively, follow the more detailed step-by-step instructions below.

The first step is to install Apache Kafka python library. For this create a new cell and add this instruction:

Next we'll import dependencies that we need:

Now let's define a function that we'll later use to run model inference and get predictions based in provided text:

Next, take files you got from your Apache Kafka instance and load them into the folder next to the notebook file.

Once that is done, set up Apache Kafka connection properties:

Create Apache Kafka consumer and logic to process consumed messages:

Create Apache Kafka producer and logic to process data that we want to send to the topics:

Finally, add a cell to start processing records from Apache Kafka input topic and enriching them:

Great! This is all the code that we need.

Step 4 - Running and testing

Run notebook cells, see output for processed messages. Each record should now contain additional properties on probabilities and a prediction done by model.

To see that data was successfully sent to Apache Kafka topic enriched_data go to "Topic messages", select JSON format and click to fetch the messages:

Now you can go through all the processed records:

You can also check the topic with errors that occurred during processing:

Clean up

Don't forget to clean up the resources you've used.
For Sagemaker you need to clean up created Endpoint, Endpoint configuration, running Model and Notebook instances. Check https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html for more details.

In Aiven for Apache Kafka select "Delete service" in the menu:

Conclusion

And that's it! We've just created a pipeline to process and analyse sentiment behind social media messages in real time. We used Apache Kafka python client library to create a producer and a consumer for Apache Kafka topic and one of the available multi-language models for text classifications.

If you enjoyed this tutorial, found any issues, or have feedback for us, please send it our way!

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Site Terms, Privacy, and more.