
Mastering the art of Data Enrichment with Apache Flink | S02 EP41 | Lets Talk About Data
In this episode we discuss enrichment patterns for streaming data, and how to implement them using Apache Flink. We then cover patterns in scenarios where reference data is static, available through external APIs, or available as a change data stream. We also dive into internal details about Flink state and how it stores reference data.
- Understanding stream processing and the need for data enrichment
- Preloading static reference data into Flink operator memory for low-latency enrichment
- Leveraging Flink state for scalable reference data storage when data is large
- Asynchronous API calls with Flink for efficient enrichment without busy waiting
- Implementing a local cache with Flink state for frequently changing reference data
- Handling late events by enriching with historically accurate reference data
- Comparing sync, async, and cached enrichment patterns in terms of performance
- Auto-scaling capabilities of AWS's managed Flink service based on CPU or custom metrics
- Enriching with rapidly changing reference data using Change Data Capture (CDC)
- Exploring code examples and demos for various enrichment patterns
Luis Morales - Senior Solutions Architect @ AWS
- Amazon Managed Service for Apache Flink - https://aws.amazon.com/managed-service-apache-flink/
- Blog - Common streaming data enrichment patterns in Amazon Kinesis Data Analytics for Apache Flink - https://aws.amazon.com/blogs/big-data/common-streaming-data-enrichment-patterns-in-amazon-kinesis-data-analytics-for-apache-flink/
- Blog - Implement Apache Flink real-time data enrichment patterns https://aws.amazon.com/blogs/big-data/implement-apache-flink-real-time-data-enrichment-patterns/
- Blog - Perform Amazon Kinesis load testing with Locust - https://aws.amazon.com/blogs/big-data/perform-amazon-kinesis-load-testing-with-locust/
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.