Kafka - Basics you need to know

Kafka is a distributed event streaming platform designed for high-throughput, fault-tolerant, and scalable data processing. It enables applications to publish, subscribe to, and process streams of events (or messages) in real-time. Kafka stores these events in topics, which are partitioned and replicated for reliability.

Topics

A topic is a logical channel through which data is written and read in a Kafka cluster. Topics are partitioned, meaning the data is split across multiple segments (partitions) to enable parallel processing and scalability. Each partition is ordered, and messages within it are assigned sequential offsets. Topics are also replicated for fault tolerance, ensuring data availability even if a broker fails. Producers write messages to a topic, while consumers read from it, either individually or as part of a consumer group for load balancing. Kafka topics can be configured with retention policies to manage how long messages are stored, making them flexible for real-time and historical data processing.

Partitions

A partition is a smaller, ordered subset of a Kafka topic, designed to enable scalability and parallelism in data processing. Each partition is identified by a unique number and stores messages in the order they are produced, with each message assigned a sequential offset. Partitions allow a topic to scale horizontally across multiple brokers, distributing both storage and processing load. Consumers in a consumer group can read from partitions in parallel, ensuring efficient data processing. Partitions also support replication for fault tolerance, with one partition replica designated as the leader to handle all reads and writes, while others act as backups.

Offsets

An offset in Kafka is a unique identifier assigned to each message within a topic partition. Offsets ensure messages are stored and retrieved in the exact order they were produced. Consumers use offsets to track their position in the stream, allowing them to resume processing from a specific point if needed. Offsets are specific to a partition and are not shared across partitions, meaning each partition has its own independent sequence. This mechanism provides precise control over message consumption and ensures reliable processing in distributed systems.

Consumers

A consumer is a client application that reads and processes messages from Kafka topics. Consumers subscribe to one or more topics and fetch data in sequential order from assigned partitions. They work as part of consumer groups, where each consumer in the group processes messages from a unique set of partitions, enabling parallelism and scalability. A KasprApp is a consumer group. Consumers use offsets to track which messages have been read, allowing them to resume processing from a specific point in case of a failure. This flexibility makes Kafka Consumers ideal for building event-driven and real-time data processing systems.

Load Balancing

Load balancing and scale out is achieved through consumer groups, where multiple consumers collaboratively process messages from a topic. Each partition in the topic is assigned to only one consumer within the group at a time, ensuring that no two consumers handle the same data. If there are more partitions than consumers, some consumers may process multiple partitions, while if there are more consumers than partitions, some consumers remain idle.

When a new consumer joins or leaves the group (or fails), Kafka triggers a rebalance to reassign partitions among the active consumers. This dynamic assignment enables horizontal scaling and fault tolerance. The rebalancing process ensures efficient load distribution, optimizing throughput while maintaining the order of messages within each partition.

Log Compaction

Log compaction in Kafka is a mechanism that ensures only the most recent value for each unique key in a topic is retained, while older records with the same key are deleted. This process helps reduce storage usage and keeps topics efficient for use cases like maintaining the current state of a system. Compaction occurs in the background, allowing Kafka to store a snapshot of the latest data while retaining older data for keys that don’t have newer updates. Unlike traditional retention, which deletes all messages after a certain time or size threshold, log compaction preserves the latest state indefinitely, ensuring durability and correctness for stateful applications.

A KasprTable uses log compaction to ensure table state can be recovered without a large space overhead.

Concepts Agents - Distributed stream processors