logo
Menu
Understanding Apache Iceberg on AWS

Understanding Apache Iceberg on AWS

We'll dive deep into the world of Apache Iceberg - the game-changing table format revolutionising data lakes on AWS. Discover how Iceberg empowers you to build reliable, scalable, and high-performance data lakes like never before

Tony Mullen
Amazon Employee
Published Sep 5, 2024
The show discusses Apache Iceberg, a table format and library that optimizes data storage and querying in data lakes. Iceberg allows for efficient updates, deletes, and schema changes, addressing common pain points in traditional data lake architectures. It integrates with popular open-source engines like Spark and Flink, as well as native AWS technologies. Iceberg provides SQL-based operations for data modification, and is ACID compliant for concurrent access.
Iceberg's key features include metadata management, snapshot tracking, and partition optimization. The metadata layer stores table version history, allowing for efficient time travel and rollback capabilities. Iceberg's "hidden partitioning" automatically translates data filters into partition pruning, enabling selective scanning of data files. The tool also provides commands to compact small files and merge delete files, optimizing storage and query performance.
The presenters discuss Iceberg's active open-source community, with contributions from various companies. They highlight resources for learning more, including an upcoming Iceberg roadshow event in London, customer success stories, and detailed technical documentation. Viewers are encouraged to reach out to the presenters on LinkedIn for further assistance. The session wraps up by previewing the next "Zero ETL" show in the series.
Highlights:
  • Iceberg is a table format and library that optimizes data lakes
  • Provides SQL-based updates, deletes, and schema changes
  • Integrates with open-source engines and AWS technologies
  • Tracks table version history for efficient time travel and rollback
  • "Hidden partitioning" enables selective data scanning
  • Provides commands to compact files and merge deletes
  • Has an active open-source community with resources available
Check out the recording here:
Loading...

Hosts of the show 🎤

Tony Mullen - Senior RDS Specialist Solutions Architect @ AWS

Guests 🎤

Carlos Rodrigues - Specialist Solution Architect, Data Analytics | AI/ML @ AWS
Angelos Chionis - Data, Analytics and AI Lead @ AWS

Links from today's episode

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments