The Year Kafka Grew Up: What version 4.x Actually Means for Platform Teams

There is a version of the Apache Kafka story that gets told as a series of press releases. ZooKeeper removed. KRaft promoted. Share groups landed. Iceberg everywhere. Each headline lands cleanly, and then platform teams go back to their actual clusters and wonder what any of it means for them.

This post is the other version. It is what happened in the Kafka ecosystem over the past twelve months, why it matters, and what you should be paying attention to heading deeper into 2026.

1. The End of ZooKeeper: What Actually Changed

1.1 The headline

Apache Kafka 4.0 dropped on March 18, 2025, and it cut ties with ZooKeeper, a dependency that had been part of its architecture for over a decade. With Kafka 4.0, ZooKeeper support was completely removed. This was not a soft deprecation or a future-dated removal notice. It was final.

1.2 Why this took so long

If you have been watching Kafka long enough to remember when KIP-500 was first proposed in 2019, you will appreciate that the journey from “we want to remove ZooKeeper” to “ZooKeeper is gone” took more than five years. That is not because the Kafka team was slow. It is because replacing the consensus and metadata layer of a distributed system that runs production workloads for thousands of organisations is genuinely hard. You do not get to break things.

KIP-500 was introduced as an early access feature in Kafka 2.8.0, released in 2021. Over the following releases, KRaft matured, gained production readiness, and introduced migration features that made it suitable for real-world use. Kafka 3.9 was the designated bridge release, and KRaft mode became the sole implementation for cluster management in Kafka 4.0.

1.3 What KRaft actually gives you

ZooKeeper was an external dependency that required its own quorum, its own monitoring, its own security configuration, and its own operational runbooks. Every engineer who has debugged a Kafka cluster that was technically healthy but fighting with an inconsistent ZooKeeper state will remember the experience fondly.

With KRaft, Kafka now manages its metadata internally in a topic called @metadata, replicated across a quorum of controller nodes. No more juggling ZooKeeper’s quirks, its distinct configuration syntax, or its resource demands.

The scalability improvement is less talked about but arguably more significant. KRaft can handle far more partitions, think millions rather than hundreds of thousands. Operations like topic creation or partition rebalancing are now O(1), as they simply append to the metadata log, rather than requiring ZooKeeper to reload the entire topic list.

For teams running large multi-tenant Kafka clusters, that partition ceiling increase is not a nice-to-have. It is the difference between running one big cluster and running a sprawl of smaller ones purely to stay within ZooKeeper’s limits.

1.4 The upgrade reality

Clusters in ZooKeeper mode must be migrated to KRaft mode before they can be upgraded to 4.0.x. For clusters in KRaft mode with versions older than 3.3.x, upgrading to 3.9.x first is recommended before upgrading to 4.0.x. If your organisation is still sitting on 3.x with ZooKeeper, that migration path is well-documented and production-proven, but it does require sequenced effort. This is not a version bump. It is an architectural transition, and it deserves to be planned accordingly.

2. KRaft at Maturity

2.1 From “good enough” to genuinely better

There is a natural scepticism that greets replacement architectures. The old thing had years of production hardening. The new thing is theoretically cleaner. Platform engineers rightly ask whether cleaner translates to more reliable when your cluster is handling peak load at 3am.

KRaft has passed that test. With the adoption of KRaft mode, Kafka simplifies deployment while improving scalability and reliability. The controller election improvements introduced in 4.0, specifically KIP-996 adding pre-votes and KIP-966 improving data consistency during leader elections, are not theoretical. They directly reduce the blast radius of controller failovers that in ZooKeeper-based clusters could cascade into extended unavailability.

2.2 Dynamic quorums change operations

Support for dynamic KRaft quorums makes adding or removing controller nodes without downtime a much simpler process. This is a significant operational improvement. In static quorum configurations, changing controller membership was a procedure that required careful sequencing and carried real risk. Dynamic quorums bring controller maintenance closer to the routine operations teams already handle for brokers.

2.3 Simplified configuration management

The default properties files for KRaft mode are no longer stored in a separate config/kraft directory since ZooKeeper has been removed. These files have been consolidated with other configuration files. Small detail, but it matters. The previous split directory structure was a constant reminder that Kafka and ZooKeeper were two separate systems loosely coupled. Consolidation signals that KRaft is not a mode. It is Kafka.

3. Share Groups and the Queue Semantics Debate

3.1 What KIP-932 actually introduces

KIP-932 introduces early access to Queues for Kafka, enabling Kafka to support traditional queue semantics. This feature introduces the concept of share groups to enable cooperative consumption using regular Kafka topics, effectively allowing Kafka to support traditional queue semantics.

The specific capabilities are worth enumerating precisely because the marketing summary elides the nuance. Queues allow multiple consumers to cooperatively read records from the same partition. Records are still consumed by a single consumer in the share group and can be acknowledged individually. That last point is critical. Individual message acknowledgment means failed messages can be redelivered without blocking the partition. This is the behaviour that teams currently implement by hand through dead letter queues, retry topics, and custom offset management.

3.2 The maturity timeline

Queues initially released in early access in 4.0. Since then, steady improvements have been made and it reached the preview state in 4.1. The plan is to mark this feature production ready in 4.2. That trajectory suggests general availability in 2026, most likely in the first half.

Kafka 4.1, released September 2, 2025, brought Queues for Kafka into preview state through KIP-932, along with a new Streams Rebalance Protocol in early access through KIP-1071.

3.3 Why this matters strategically

Teams who chose Kafka over RabbitMQ or SQS typically did so for its durability guarantees, its replay capability, and its throughput characteristics. What they lost was the simpler consumer model where any idle worker picks up the next available message without complex partition assignment.

Share groups close that gap without abandoning Kafka’s fundamental model. A single platform team running Kafka can now support event streaming workloads, stream processing workloads, and queue-style task distribution workloads on the same infrastructure. The operational simplification alone justifies close attention.

3.4 KIP-848: The rebalance protocol that makes it possible

KIP-848, the next generation consumer group protocol, is the foundation that share groups build on. KIP-848 addressed full-group stop-the-world issues due to membership changes, ephemeral member ID assignments tied to heartbeat statuses, and brittle client-side assignments. It resolves these by allowing incremental assignments, removing ephemeral memberships by introducing durable member IDs, and eliminating fragile client-side assignment by moving coordination to the broker, making consumer groups faster and more reliable.

The Next Generation of the Consumer Rebalance Protocol is now Generally Available in Apache Kafka 4.0. The protocol is automatically enabled on the server when the upgrade to 4.0 is finalised. Clients opt in by setting group.protocol=consumer.

For large-scale deployments, the elimination of stop-the-world rebalances is not a marginal improvement. It is the difference between consumer groups that are resilient to membership changes at scale and consumer groups that become a reliability liability as they grow.

4. Diskless and Cloud Native

4.1 The structural problem KIP-500 did not solve

ZooKeeper removal cleaned up the metadata plane. It did not touch the most expensive part of running Kafka in the cloud: the data plane. Traditional Kafka brokers are stateful. They maintain local disk storage for their partition logs. They replicate across availability zones, paying cross-AZ data transfer costs on every write. They over-provision compute capacity because resizing means moving data, which is slow and operationally risky.

Those costs accumulate at scale. The industry response has been both a series of competing Kafka Improvement Proposals and a generation of Kafka-compatible startups built on fundamentally different storage architectures.

4.2 Tiered storage as the first step

Tiered storage, introduced as a production feature in Kafka 3.6, allows Kafka to offload older log segments to object storage while keeping recent data on local broker disks. This reduces storage costs without changing the core write path. It is the pragmatic middle ground and it is now widely deployed.

The limitation is that tiered storage only addresses cold data. The active write path still carries the full cost of inter-AZ replication, and brokers remain stateful for the segments they hold locally.

4.3 The KIP-1150 diskless proposal

KIP-1150, known as Diskless Topics, is a major proposal to re-architect how Kafka handles data in the cloud. It proposes allowing topics to store their data directly in object storage instead of on broker-local disks.

The proposal introduces a leaderless architecture where any broker can write data to a shared object store, bypassing the traditional replication process and its associated costs. To optimise writes to the remote storage system, brokers can group records from different topic-partitions into an object called a shared log segment object. A new coordination layer is then used to retrieve specific records from these objects.

Eliminating inter-AZ replication traffic is the core economic argument. For clusters with meaningful throughput spread across multiple availability zones, that traffic is often the dominant cost line.

4.4 The contested path to standardisation

The Kafka community finds itself at a fork in the road with three KIPs simultaneously addressing the same challenge of high replication costs when running Kafka across multiple cloud availability zones: KIP-1150, KIP-1176, and KIP-1183.

The good news from late 2025 is that community consolidation has begun. Slack announced its intention to withdraw KIP-1176 and contribute to KIP-1150 instead, reducing fragmentation risk. Whether the community converges on a single approach in time for a Kafka 5.x release or whether this extends further is genuinely uncertain. What is certain is that the direction is set. Kafka brokers will eventually stop owning the storage layer.

4.5 The commercial implementations that already exist

While the community debates the right open source path, several production implementations already run on fully disaggregated storage architectures. WarpStream, AutoMQ, and others offer Kafka-compatible services built entirely on object storage.

AutoMQ stores data entirely on S3 but adopts a different architecture. By decoupling storage and computation, it offloads storage to EBS and S3, maintaining full Kafka compatibility without compromising on latency. Confluent has implemented storage-compute separation within its serverless Confluent Cloud, with some cases showing up to 90% cost reduction compared to traditional clusters.

For organisations making infrastructure decisions now, the practical question is not whether diskless Kafka will exist in open source. It is whether the cost savings justify moving to a commercial implementation ahead of upstream standardisation, accepting either vendor lock-in or protocol compatibility risk as the trade.

5. Apache Iceberg

5.1 Why Iceberg keeps appearing in Kafka conversations

Apache Iceberg is a table format for large analytic datasets, not a streaming system. But it has become inescapable in Kafka ecosystem discussions because it solves a problem that every organisation with a Kafka cluster eventually faces: how do you make streaming data queryable without building and maintaining a custom ETL pipeline?

In 2025, Confluent Cloud Tableflow went generally available for Iceberg, WarpStream released their own Tableflow equivalent, and Aiven released open-source Iceberg Topics. The pattern is consistent across vendors: surface Kafka topics as Iceberg tables without requiring users to build the conversion infrastructure themselves.

5.2 The Confluent Tableflow approach

Tableflow, which became generally available in 2025, converts Kafka topics to Iceberg tables automatically. Data engineers can query Kafka topics using standard Iceberg-compatible engines, DuckDB, Apache Spark, Trino, Snowflake, and others, without the topic data ever having to leave the streaming layer. The streaming system and the analytical system share the same underlying data.

5.3 The cost and complexity of conversion

Generating Parquet files, the underlying format for Iceberg tables, is computationally expensive. Compared to copying a log segment from local disk to object storage, it uses at least an order of magnitude more CPU cycles and significant amounts of memory. That is fine if this operation is running on a random stateless compute node, but it runs on one of the incredibly important Kafka brokers that is the leader for some of the topic-partitions in your cluster.

That trade-off is real and worth being honest about. Direct Iceberg conversion on the broker adds compute load to exactly the components that should be focused on reliable message delivery. Organisations evaluating Iceberg-native Kafka features should test the conversion overhead against their actual topic throughput, not assume the feature is operationally free.

5.4 What Iceberg adoption actually means for architecture

The broader implication is architectural. Historically, organisations maintained a streaming layer and a data warehouse or lake layer as separate systems, connected by batch ETL jobs or streaming connectors. Iceberg, combined with the current generation of Kafka implementations, is collapsing that boundary. The streaming layer becomes the table layer. Downstream consumers, whether analysts running SQL or ML pipelines reading Parquet, access the same data without a coordination step in between.

For banking and financial services specifically, where regulatory requirements demand audit trails and the ability to replay historical data, Iceberg topics offer a compelling combination: low latency streaming semantics for operational systems and high-throughput analytical query access for compliance and reporting, from the same dataset.

6. Kafka 4.x: Release Cadence and What Is Coming

6.1 The 2025 release picture

Kafka 4.0 shipped in March 2025, removing ZooKeeper and delivering KIP-848 as GA. Kafka 4.1 shipped in September 2025, promoting KIP-932 share groups to preview and introducing the new Streams Rebalance Protocol in early access. Kafka 4.2 was in development by late 2025.

The plan is to mark Queues for Kafka production ready in 4.2. That makes Kafka 4.2 the release platform teams running task distribution workloads should be watching most closely.

6.2 The Java requirement shift

Kafka 4.0 dropped support for Java 8. Clients and Streams now require Java 11, while brokers, tools, and Connect now require Java 17. This is not theoretical compatibility noise. Organisations running Kafka brokers on Java 11 need to update their deployment configurations before upgrading to 4.0 and above. The brokers will not start on older JVM versions. This is a forcing function for JVM standardisation that some infrastructure teams will experience as unwelcome but is ultimately the right move for long-term supportability.

6.3 API compatibility boundary

Kafka 4.0 only supports KRaft mode, and old protocol API versions have been removed. Users should ensure brokers are version 2.1 or higher before upgrading Java clients to 4.0. Similarly, users should ensure their Java client version is 2.1 or higher before upgrading brokers to 4.0.

The backward compatibility window has been formally shortened. Kafka 2.1 is now the baseline. Clients older than that will not connect to Kafka 4.x brokers. For organisations with heterogeneous client deployments, including legacy applications with embedded Kafka clients, an audit of client library versions is a prerequisite for any 4.x upgrade planning.

7. The Operator Perspective: What to Prioritise in 2026

The Kafka ecosystem in 2026 is richer and more complex than it has ever been. That is good for organisations that can absorb and apply the changes. It is a liability for teams that try to follow everything simultaneously.

The practical prioritisation for most platform teams is:

Immediate: ZooKeeper migration if not already complete. Kafka 3.9 remains supported but is the last release to support ZooKeeper. Running ZooKeeper-based clusters means running against an architecture the community has formally closed. The migration tooling is mature. The risk of delay is accumulating technical debt against a hard deadline.

Near term: KIP-848 client adoption. The new consumer group protocol is enabled on brokers in Kafka 4.0 but clients must opt in. Consumer teams that update their configuration to use group.protocol=consumer will gain the stability benefits of incremental rebalances. The cost of not doing so is continuing to take stop-the-world rebalances that the protocol was specifically designed to eliminate.

Medium term: Evaluate share groups in 4.2. The preview status in 4.1 is the right time to prototype workloads that currently use separate queuing infrastructure. When 4.2 brings GA status, organisations that have already tested share groups against their use cases will be positioned to consolidate faster.

Strategic: Watch KIP-1150 and the diskless architecture consolidation. This is a decision point that will not require immediate action but will have significant infrastructure cost implications over a two to three year horizon. Organisations making cloud infrastructure investments now should ensure their Kafka deployment architecture does not foreclose the options that diskless brokers will enable.

Ongoing: Evaluate Iceberg integration against actual query patterns. Iceberg topics are compelling, but the compute overhead of conversion is real. Pilot the feature against production topic throughput before committing to it as a platform-wide pattern.

Closing Observation

Kafka’s 2025 story is not primarily about any single feature. It is about a platform that has been methodically resolving the architectural compromises it made under the constraint of early-stage distributed systems thinking, while simultaneously facing a generation of cloud-native competitors built without those constraints.

The ZooKeeper removal closes a chapter that should have closed sooner but could not close safely until it did. KRaft’s maturity delivers the architectural simplicity that always made conceptual sense. Share groups extend Kafka’s relevance to workloads that previously required a second queuing system. And the diskless architecture debate, contested as it is, points toward a future where the cost of operating Kafka at scale declines materially.

For engineering leaders, the consistent signal is that investment in Kafka expertise remains well placed. The platform is maturing in the right directions. The question is execution: how quickly can your organisation absorb the changes that are already available and position itself for the ones that are still landing.

Andrew Baker is Chief Information Officer at Capitec Bank. The views expressed here are personal and do not represent Capitec Bank or its technology strategy.

Leave a Reply

Your email address will not be published. Required fields are marked *