Mastering the art of Data Enrichment with Apache Flink | S02 EP41 | Lets Talk About Data
In this episode we discuss enrichment patterns for streaming data, and how to implement them using Apache Flink. We then cover patterns in scenarios where reference data is static, available through external APIs, or available as a change data stream. We also dive into internal details about Flink state and how it stores reference data.
Prasad Matkar
Amazon Employee
Published Nov 7, 2024
In this Twitch show, the guests, Subham and Luis share their expertise on data enrichment patterns using Apache Flink. They discuss scenarios where reference data is static, fetched from APIs, or available as a change data stream. The discussion covers the advantages of stateful stream processing with Flink and techniques for handling late or out-of-order events. Subham and Luis then demonstrate code examples and architectural patterns to enrich streaming data efficiently covering topics such as preloading reference data, leveraging Flink state, async IO, caching, and handling rapidly changing reference data. They also touch upon the scalability and auto-scaling capabilities of AWS's managed Flink service.
Key Highlights:
- Understanding stream processing and the need for data enrichment
- Preloading static reference data into Flink operator memory for low-latency enrichment
- Leveraging Flink state for scalable reference data storage when data is large
- Asynchronous API calls with Flink for efficient enrichment without busy waiting
- Implementing a local cache with Flink state for frequently changing reference data
- Handling late events by enriching with historically accurate reference data
- Comparing sync, async, and cached enrichment patterns in terms of performance
- Auto-scaling capabilities of AWS's managed Flink service based on CPU or custom metrics
- Enriching with rapidly changing reference data using Change Data Capture (CDC)
- Exploring code examples and demos for various enrichment patterns
Check out the recording here:
Loading...
Prasad Matkar - Database Specialist SA @ AWS
Subham Rakshit - Senior Analytics Solutions Architect @ AWS
Luis Morales - Senior Solutions Architect @ AWS
Luis Morales - Senior Solutions Architect @ AWS
- Amazon Managed Service for Apache Flink - https://aws.amazon.com/managed-service-apache-flink/
- Blog - Common streaming data enrichment patterns in Amazon Kinesis Data Analytics for Apache Flink - https://aws.amazon.com/blogs/big-data/common-streaming-data-enrichment-patterns-in-amazon-kinesis-data-analytics-for-apache-flink/
- Blog - Implement Apache Flink real-time data enrichment patterns https://aws.amazon.com/blogs/big-data/implement-apache-flink-real-time-data-enrichment-patterns/
- Blog - Perform Amazon Kinesis load testing with Locust - https://aws.amazon.com/blogs/big-data/perform-amazon-kinesis-load-testing-with-locust/
You can check out our past shows from out community page -https://community.aws/livestreams/lets-talk-about-data
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.