In the Land of the Sizing, the One-Partition Kafka Topic is King
Deconstructing the mental model around partitions in Apache Kafka
Published Aug 22, 2022
Last Modified Mar 21, 2024
Continue to Apache Kafka Partitions as a Unit of Parallelism
Every technology has that key concept that people struggle to understand. With databases, the struggle usually happens when you have to decide which join clause to use for fetching data from multiple tables. Which one is faster? What about consistency? How will concurrency look like if I pick this one versus the other? It is often a hard choice. Containers are another great example. Implementing persistence with containers is troublesome because each workload has its own set of requirements, and there is no silver bullet. For example, you may have a workload that requires each container to store 20% of the dataset locally, whereas the other 80% should go straight to a shared filesystem that is mounted on every container instance. But this design wouldn't suit microservices based architecture where each service should have their own data store.
Just like any other technology, Apache Kafka also has a key concept that people struggle to understand, which is partitions. Partitions are tricky because they dictate pretty much everything about how Kafka works, and picking the right number is not simple. If you go to places like Stack Overflow and Hacker News, you will see developers providing objective answers to questions like how to install Kafka on Kubernetes, how to back up Kafka data on Amazon S3, and how to implement a Kafka consumer in Java. But when someone asks how many partitions to set for their Kafka topics, oh boy, this is where you see those long threads full of subjective opinions that more often than not don't provide a concrete answer. And even when they do, it is a total guessing game.
Why is this important? You may ask. Well, sizing Kafka partitions wrongly affects many aspects of the system, such as consistency, parallelism, and durability. Worse, it may also affect how much load Kafka can handle. Hence why often the decision about how many partitions to set for a topic is handled by Ops teams, as we see this to be only an infrastructure matter. In reality, it's an architectural design decision that affects even the amount of code you write. In this blog post series, I will share everything you need to build more confidence about how to size Kafka partitions correctly, and to spot a poor decision when you see one. This first part will peel off the concept of partitions, and highlight their role with Kafka.
In 2022 I also presented about this topic in the Strange Loop conference in St. Louis, MO. If you fancy to watch the recording of my session, check the video below.
Going to the doctor's office for your annual check-up is one of those annoying situations in life. You go because you need to, not because you want to. Once you are there; you promptly do everything you are told and answer as many questions your physician asks you, with that inner feeling that you want to hear that everything is okay. The reality, however, is rather disappointing. There are always things to discuss, even more so as you get older. Though your physician may try their best to explain the nuts and bolts of your situation, you don't really care about the nitty details. You care about concrete things, such as whether you can do that thing you love again.
But what does all of this have to do with Kafka? Well, explaining what partitions are is like a physician explaining to the patient about their condition. Most people don't really care about the nitty details. They just want to hear that everything is okay, and also about concrete things such as whether the cluster can handle more load. Controversially, there is no better way to assess a Kafka cluster than looking into things at the partition's level. Partitions are the heart of everything Kafka does, and truly understanding them is your ticket to master Kafka and realize why things behave the way they behave.
For this reason, I will dive into what partitions are and the role they play for the Kafka cluster and its clients. To get things started, I would like to encourage you to ignore what most people say about partitions being just Kafka's unit-of-parallelism. I mean, it's not entirely wrong, but this definition is diminishing and incomplete. Instead, let's work with the following definition:
I invite you to dive deep into each one of them in the following parts of this series. Use the navigation bar below to access each part.
Continue to Apache Kafka Partitions as a Unit of Parallelism