The Enterprise Service Bus (ESB) once promised to be the silver bullet for enterprise integration. Organizations invested millions in platforms like MuleSoft, IBM Integration Bus, Oracle Service Bus, and TIBCO BusinessWorks, believing they would solve all their integration challenges. Today, these same organizations are discovering that their ESB has become their biggest architectural liability.
The rise of Apache Kafka, Spring Boot, and microservices architecture represents more than just a technological shift. It represents a fundamental rethinking of how we build scalable, resilient systems. This article examines why ESBs are dying, how they actively harm businesses, and why the combination of Java, Spring, and Kafka provides a superior alternative.
2. The False Promise of the ESB
Enterprise Service Buses emerged in the early 2000s as a solution to point-to-point integration chaos. The pitch was compelling: a single, centralized platform that would mediate all communication between systems, apply transformations, enforce governance, and provide a unified integration layer.
The reality turned out very differently. What organizations got instead was a monolithic bottleneck that became increasingly difficult to change, scale, or maintain. The ESB became the very problem it was meant to solve.
3. How ESBs Kill Business Velocity
3.1. The Release Coordination Nightmare
Every change to an ESB requires coordination across multiple teams. Want to update an endpoint? You need to test every flow that might be affected. Need to add a new integration? You risk breaking existing integrations. The ESB becomes a coordination bottleneck where release cycles stretch from days to weeks or even months.
In a Kafka and microservices architecture, services are independently deployable. Teams can release changes to their own services without coordinating with dozens of other teams. A payment service can be updated without touching the order service, the inventory service, or any other component. This independence translates directly to business velocity.
3.2. The Scaling Ceiling
ESBs scale vertically, not horizontally. When you hit performance limits, you buy bigger hardware or cluster nodes, which introduces complexity and cost. More critically, you hit hard limits. There is only so much you can scale a monolithic integration platform.
Kafka was designed for horizontal scaling from day one. Need more throughput? Add more brokers. Need to handle more consumers? Add more consumer instances. A single Kafka cluster can handle millions of messages per second across hundreds of nodes. This is not theoretical scaling. This is proven at companies like LinkedIn, Netflix, and Uber handling trillions of events daily.
3.3. The Single Point of Failure Problem
An ESB is a single critical service that everything depends on. When it goes down, your entire business grinds to a halt. Payments stop processing. Orders cannot be placed. Customer requests fail. The blast radius of an ESB failure is catastrophic.
With Kafka and microservices, failure is isolated. If one microservice fails, it affects only that service’s functionality. Kafka itself is distributed and fault tolerant. With proper replication settings, you can lose entire brokers without losing data or availability. The architecture is resilient by design, not by hoping your single ESB cluster stays up.
4. The Technical Debt Trap
4.1. Upgrade Hell
ESB upgrades are terrifying events. You are upgrading a platform that mediates potentially hundreds of integrations. Testing requires validating every single flow. Rollback is complicated or impossible. Organizations commonly run ESB versions that are years out of date because the risk and effort of upgrading is too high.
Spring Boot applications follow standard semantic versioning and upgrade paths. Kafka upgrades are rolling upgrades with backward compatibility guarantees. You upgrade one service at a time, one broker at a time. The risk is contained. The effort is manageable.
4.2. Vendor Lock-In
ESB platforms come with proprietary development tools, proprietary languages, and proprietary deployment models. Your integration logic is written in vendor-specific formats that cannot be easily migrated. When you want to leave, you face rewriting everything from scratch.
Kafka is open source. Spring is open source. Java is a standard. Your code is portable. Your skills are transferable. You are not locked into a single vendor’s roadmap or pricing model.
4.3. The Talent Problem
Finding developers who want to work with ESB platforms is increasingly difficult. The best engineers want to work with modern technologies, not proprietary integration platforms. ESB skills are legacy skills. Kafka and Spring skills are in high demand.
This talent gap creates a vicious cycle. Your ESB becomes harder to maintain because you cannot hire good people to work on it. The people you do have become increasingly specialized in a dying technology, making it even harder to transition away.
5. The Pitfalls That Kill ESBs
5.1. Message Poisoning
A single malformed message can crash an ESB flow. Worse, that message can sit in a queue or topic, repeatedly crashing the flow every time it is processed. The ESB lacks sophisticated dead-letter queue handling, lacks proper message validation frameworks, and lacks the observability to quickly identify and fix poison message problems.
Kafka with Spring Kafka provides robust error handling. Dead-letter topics are first-class concepts. You can configure retry policies, error handlers, and message filtering at the consumer level. When poison messages occur, they are isolated and can be processed separately without bringing down your entire integration layer.
5.2. Resource Contention
All integrations share the same ESB resources. A poorly performing transformation or a high-volume integration can starve other integrations of CPU, memory, or thread pool resources. You cannot isolate workloads effectively.
Microservices run in isolated containers with dedicated resources. Kubernetes provides resource quotas, limits, and quality-of-service guarantees. One service consuming excessive resources does not impact others. You can scale services independently based on their specific needs.
5.3. Configuration Complexity
ESB configurations grow into sprawling XML files or proprietary configuration formats with thousands of lines. Understanding the full impact of a change requires expert knowledge of the entire configuration. Documentation falls out of date. Tribal knowledge becomes critical.
Spring Boot uses convention over configuration with sensible defaults. Kafka configuration is straightforward properties files. Infrastructure-as-code tools like Terraform and Helm manage deployment configurations in version-controlled, testable formats. Complexity is managed through modularity, not through ever-growing monolithic configurations.
5.4. Lack of Elasticity
ESBs cannot auto-scale based on load. You provision for peak capacity and waste resources during normal operation. When unexpected load hits, you cannot quickly add capacity. Manual intervention is required, and by the time you scale up, you have already experienced an outage.
Kubernetes Horizontal Pod Autoscaler can scale microservices based on CPU, memory, or custom metrics like message lag. Kafka consumer groups automatically rebalance when you add or remove instances. The system adapts to load automatically, scaling up during peaks and scaling down during quiet periods.
6. The Java, Spring, and Kafka Alternative
6.1. Modern Java Performance
Java 25 represents the cutting edge of JVM performance and developer productivity. Virtual threads, now mature and production-hardened, enable massive concurrency with minimal resource overhead. The pauseless garbage collectors, ZGC and Shenandoah, eliminate GC pause times even for multi-terabyte heaps, making Java competitive with languages that traditionally claimed performance advantages.
The ahead-of-time compilation cache dramatically reduces startup times and improves peak performance by sharing optimized code across JVM instances. This makes Java microservices start in milliseconds rather than seconds, fundamentally changing deployment dynamics in containerized environments.
This is not incremental improvement. Java 25 represents a generational leap in performance, efficiency, and developer experience that makes it the ideal foundation for high-throughput microservices.
6.2. Spring Boot Productivity
Spring Boot eliminates boilerplate. Auto-configuration sets up your application with sensible defaults. Spring Kafka provides high-level abstractions over Kafka consumers and producers. Spring Cloud Stream enables event-driven microservices with minimal code.
A complete Kafka consumer microservice can be written in under 100 lines of code. Testing is straightforward with embedded Kafka. Observability comes built in with Micrometer metrics and distributed tracing support.
6.3. Kafka as the Integration Backbone
Kafka is not just a message broker. It is a distributed commit log that provides durable, ordered, replayable streams of events. This fundamentally changes how you think about integration.
With Kafka 4.2, the platform has evolved even further by introducing native queue support alongside its traditional topic-based architecture. This means you can now implement classic queue semantics with competing consumers for workload distribution while still benefiting from Kafka’s durability, scalability, and operational simplicity. Organizations no longer need separate queue infrastructure for point-to-point messaging patterns.
Instead of request-response patterns mediated by an ESB, you have event streams that services can consume at their own pace. Instead of transformations happening in a central layer, transformations happen in microservices close to the data. Instead of a single integration layer, you have a distributed data platform that handles both streaming and queuing workloads.
7. Real-World Patterns
7.1. Event Sourcing
Store every state change as an event in Kafka. Your services consume these events to build their own views of the data. You get complete audit trails, temporal queries, and the ability to rebuild state by replaying events.
ESBs cannot do this. They are designed for transient message passing, not durable event storage.
7.2. Change Data Capture
Use tools like Debezium to capture database changes and stream them to Kafka. Your microservices react to these change events without complex database triggers or polling. You get near real-time data pipelines without the fragility of ESB database adapters.
7.3. Saga Patterns
Implement distributed transactions using choreography or orchestration patterns with Kafka. Each service publishes events about its local transactions. Other services react to these events to complete their portion of the saga. You get eventual consistency without distributed locks or two-phase commit.
ESBs attempt to solve this with BPEL or proprietary orchestration engines that become unmaintainable complexity.
7.4. Work Queue Distribution
With Kafka 4.2’s native queue support, you can implement traditional work-queue patterns where tasks are distributed among competing consumers. This is perfect for batch processing, background jobs, and task distribution scenarios that previously required separate queue infrastructure like RabbitMQ or ActiveMQ. Now you get queue semantics with Kafka’s operational benefits.
8. The Migration Path
8.1. Strangler Fig Pattern
You do not need to rip out your ESB overnight. Apply the strangler fig pattern. Identify new integrations or integrations that need significant changes. Implement these as microservices with Kafka instead of ESB flows. Gradually migrate existing integrations as they require updates.
Over time, the ESB shrinks while your Kafka ecosystem grows. Eventually, the ESB becomes small enough to eliminate entirely.
8.2. Event Gateway
Deploy a Kafka-to-ESB bridge for transition periods. Services publish events to Kafka. The bridge consumes these events and forwards them to ESB endpoints where necessary. This allows new services to be built on Kafka while maintaining compatibility with legacy ESB integrations.
8.3. Invest in Platform Engineering
Build internal platforms and tooling around your Kafka and microservices architecture. Provide templates, generators, and golden-path patterns that make it easier to build microservices correctly than to add another ESB flow.
Platform engineering accelerates the migration by making the right way the easy way.
9. The Cost Reality
Organizations often justify ESBs based on licensing costs versus building custom integrations. This analysis is fundamentally flawed.
ESB licenses are expensive, but that is just the beginning. Add the cost of specialized consultants. Add the cost of extended release cycles. Add the opportunity cost of features not delivered because teams are blocked on ESB changes. Add the cost of outages when the ESB fails.
Kafka is open source with zero licensing costs. Spring is open source. Java is free. The tooling ecosystem is mature and open source. Your costs shift from licensing to engineering time, but that engineering time produces assets you own and can evolve without vendor dependency.
More critically, the business velocity enabled by microservices and Kafka translates directly to revenue. Features ship faster. Systems scale to meet demand. You capture opportunities that ESB architectures would have missed.
10. Conclusion
The ESB is a relic of an era when centralization seemed like the answer to complexity. We now know that centralization creates brittleness, bottlenecks, and business risk.
Kafka and microservices represent a fundamentally better approach. Distributed ownership, independent scalability, fault isolation, and evolutionary architecture are not just technical benefits. They are business imperatives in a world where velocity and resilience determine winners and losers.
The question is not whether to move away from ESBs. The question is how quickly you can execute that transition before your ESB becomes an existential business risk. Every day you remain on an ESB architecture is a day your competitors gain ground with more agile, scalable systems.
The death of the ESB is not a tragedy. It is an opportunity to build systems that actually work at the scale and pace modern business demands. Java, Spring, and Kafka provide the foundation for that future. The only question is whether you will embrace it before it is too late.
Java’s concurrency model has undergone a revolutionary transformation with the introduction of Virtual Threads in Java 19 (as a preview feature) and their stabilization in Java 21. With Java 25, virtual threads have reached new levels of maturity by addressing critical pinning issues that previously limited their effectiveness. This article explores the evolution of threading models in Java, the problems virtual threads solve, and how Java 25 has refined this powerful concurrency primitive.
Virtual threads represent a paradigm shift in how we write concurrent Java applications. They enable the traditional thread per request model to scale to millions of concurrent operations without the resource overhead that plagued platform threads. Understanding virtual threads is essential for modern Java developers building high throughput, scalable applications.
2. The Problem with Traditional Platform Threads
2.1. Platform Thread Architecture
Platform threads (also called OS threads or kernel threads) are the traditional concurrency mechanism in Java. Each Java thread is a thin wrapper around an operating system thread, which looks like:
2.2. Resource Constraints
Platform threads are expensive resources:
Memory Overhead: Each platform thread requires a stack (typically 1MB by default), which means 1,000 threads consume approximately 1GB of memory just for stacks.
Context Switching Cost: The OS scheduler must perform context switches between threads, saving and restoring CPU registers, memory mappings, and other state.
Limited Scalability: Creating tens of thousands of platform threads leads to:
Memory exhaustion
Increased context switching overhead
CPU cache thrashing
Scheduler contention
2.3. The Thread Pool Pattern and Its Limitations
To manage these constraints, developers traditionally use thread pools:
ExecutorService executor = Executors.newFixedThreadPool(200);
// Submit tasks to the pool
for (int i = 0; i < 10000; i++) {
executor.submit(() -> {
// Perform I/O operation
String data = fetchDataFromDatabase();
processData(data);
});
}
Problems with Thread Pools:
Task Queuing: With limited threads, tasks queue up waiting for available threads
Resource Underutilization: Threads blocked on I/O waste CPU time
Complexity: Tuning pool sizes becomes an art form
Poor Observability: Stack traces don’t reflect actual application structure
Thread Pool (Size: 4)
┌──────┬──────┬──────┬──────┐
│Thread│Thread│Thread│Thread│
│ 1 │ 2 │ 3 │ 4 │
│BLOCK │BLOCK │BLOCK │BLOCK │
└──────┴──────┴──────┴──────┘
↑
All threads blocked on I/O
Task Queue: [Task5, Task6, Task7, ..., Task1000]
↑
Waiting for available thread
2.4. The Reactive Programming Alternative
To avoid blocking threads, reactive programming emerged:
Steep Learning Curve: Requires understanding operators like flatMap, zip, merge
Difficult Debugging: Stack traces are fragmented and hard to follow
Imperative to Declarative: Forces a complete mental model shift
Library Compatibility: Not all libraries support reactive patterns
Error Handling: Becomes significantly more complex
3. Enter Virtual Threads: Lightweight Concurrency
3.1. The Virtual Thread Concept
Virtual threads are lightweight threads managed by the JVM rather than the operating system. They enable the thread per task programming model to scale:
Key Characteristics:
Cheap to Create: Creating a virtual thread takes microseconds and minimal memory
JVM Managed: The JVM scheduler multiplexes virtual threads onto a small pool of OS threads (carrier threads)
Blocking is Fine: When a virtual thread blocks on I/O, the JVM unmounts it from its carrier thread
Millions Scale: You can create millions of virtual threads without exhausting memory
3.2. How Virtual Threads Work Under the Hood
When a virtual thread performs a blocking operation:
Virtual threads use a mechanism called continuations. Below is an explanation of the continuation mechanism:
A virtual thread begins executing on some carrier (an OS thread under the hood), as though it were a normal thread.
When it hits a blocking operation (I/O, sleep, etc), the runtime arranges to save where it is (its stack frames, locals) into a continuation object (or the equivalent mechanism).
That carrier thread is released (so it can run other virtual threads) while the virtual thread is waiting.
Later when the blocking completes / the virtual thread is ready to resume, the continuation is scheduled on some carrier thread, its state restored and execution continues.
A simplified conceptual model looks like this:
// Simplified conceptual representation
class VirtualThread {
Continuation continuation;
Object mountedCarrierThread;
void park() {
// Save execution state
continuation.yield();
// Unmount from carrier thread
mountedCarrierThread = null;
}
void unpark() {
// Find available carrier thread
mountedCarrierThread = getAvailableCarrier();
// Restore execution state
continuation.run();
}
}
This example shows how virtual threads simplify server design by allowing each incoming HTTP request to be handled in its own virtual thread, just like the classic thread-per-request model—only now it scales.
The code below creates an executor that launches a new virtual thread for every request. Inside that thread, the handler performs blocking I/O (reading the request and writing the response) in a natural, linear style. There’s no need for callbacks, reactive chains, or custom thread pools, because blocking no longer ties up an OS thread.
Each request runs independently, errors are isolated, and the system can support a very large number of concurrent connections thanks to the low cost of virtual threads.
The new virtual thread version is dramatically simpler because it uses plain blocking code without threadpool tuning, callback handlers, or complex asynchronous frameworks.
// Traditional Platform Thread Approach
public class PlatformThreadServer {
private static final ExecutorService executor =
Executors.newFixedThreadPool(200);
public void handleRequest(HttpRequest request) {
executor.submit(() -> {
try {
// Simulate database query (blocking I/O)
Thread.sleep(100);
String data = queryDatabase(request);
// Simulate external API call (blocking I/O)
Thread.sleep(50);
String apiResult = callExternalApi(data);
sendResponse(apiResult);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
});
}
}
// Virtual Thread Approach
public class VirtualThreadServer {
private static final ExecutorService executor =
Executors.newVirtualThreadPerTaskExecutor();
public void handleRequest(HttpRequest request) {
executor.submit(() -> {
try {
// Same blocking code, but now scalable!
Thread.sleep(100);
String data = queryDatabase(request);
Thread.sleep(50);
String apiResult = callExternalApi(data);
sendResponse(apiResult);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
});
}
}
Performance Comparison:
Platform Thread Server (200 thread pool):
- Max concurrent requests: ~200
- Memory overhead: ~200MB (thread stacks)
- Throughput: Limited by pool size
Virtual Thread Server:
- Max concurrent requests: ~1,000,000+
- Memory overhead: ~1MB per 1000 threads
- Throughput: Limited by available I/O resources
4.4. Structured Concurrency
Traditional Java concurrency makes it easy to start threads but hard to control their lifecycle. Tasks can outlive the method that created them, failures get lost, and background work becomes difficult to reason about.
Structured concurrency fixes this by enforcing a simple rule:
tasks started in a scope must finish before the scope exits.
This gives you predictable ownership, automatic cleanup, and reliable error propagation.
With virtual threads, this model finally becomes practical. Virtual threads are cheap to create and safe to block, so you can express concurrent logic using straightforward, synchronous-looking code—without thread pools or callbacks.
Example
try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
var f1 = scope.fork(() -> fetchUser(id));
var f2 = scope.fork(() -> fetchOrders(id));
scope.join();
scope.throwIfFailed();
return new UserData(f1.get(), f2.get());
}
All tasks run concurrently, but the structure remains clear:
the parent waits for all children,
failures propagate correctly,
and no threads leak beyond the scope.
In short: virtual threads provide the scalability; structured concurrency provides the clarity. Together they make concurrent Java code simple, safe, and predictable.
5. Issues with Virtual Threads Before Java 25
5.1. The Pinning Problem
The most significant issue with virtual threads before Java 25 was “pinning” – situations where a virtual thread could not unmount from its carrier thread when blocking, defeating the purpose of virtual threads.
Pinning occurred in two main scenarios:
5.1.1. Synchronized Blocks
public class PinningExample {
private final Object lock = new Object();
public void problematicMethod() {
synchronized (lock) { // PINNING OCCURS HERE
try {
// This sleep pins the carrier thread
Thread.sleep(1000);
// I/O operations also pin
String data = blockingDatabaseCall();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
}
}
What happens during pinning:
Before Pinning:
┌─────────────┐
│Virtual │
│Thread A │
└─────┬───────┘
│ Mounted
↓
┌─────────────┐
│Carrier │
│Thread 1 │
└─────────────┘
During Synchronized Block (Pinned):
┌─────────────┐
│Virtual │
│Thread A │ ← Cannot unmount due to synchronized
│(BLOCKED) │
└─────┬───────┘
│ PINNED
↓
┌─────────────┐
│Carrier │ ← Wasted, cannot be used by other
│Thread 1 │ virtual threads
│(BLOCKED) │
└─────────────┘
Other Virtual Threads Queue Up:
[VThread B] [VThread C] [VThread D] ...
↓
Waiting for available carrier threads
5.1.2. Native Methods and Foreign Functions
public class NativePinningExample {
public void callNativeCode() {
// JNI calls pin the virtual thread
nativeMethod(); // PINNING
}
private native void nativeMethod();
public void foreignFunctionCall() {
// Foreign function calls (Project Panama) also pin
try (Arena arena = Arena.ofConfined()) {
MemorySegment segment = arena.allocate(100);
// Operations here may pin
}
}
}
5.2. Monitoring Pinning Events
Before Java 25, you could detect pinning with JVM flags:
Before Java 25, libraries and applications had to refactor synchronized code:
// Pre-Java 25: Had to refactor to avoid pinning
public class PreJava25Approach {
// Changed from Object to ReentrantLock
private final ReentrantLock lock = new ReentrantLock();
public void doWork() {
lock.lock(); // More verbose
try {
blockingOperation();
} finally {
lock.unlock();
}
}
}
// Java 25+: Can keep existing synchronized code
public class Java25Approach {
private final Object lock = new Object();
public synchronized void doWork() { // Simple, no pinning
blockingOperation();
}
}
6.5. Remaining Pinning Scenarios
Java 25 removes most cases where virtual threads could become pinned, but a few situations can still prevent a virtual thread from unmounting from its carrier thread:
1. Blocking Native Calls (JNI)
If a virtual thread enters a JNI method that blocks, the JVM cannot safely suspend it, so the carrier thread remains pinned until the native call returns.
2. Synchronized Blocks Leading Into Native Work
Although Java-level synchronization no longer pins, a synchronized section that transitions into a blocking native operation can still force the carrier thread to stay attached.
3. Low-Level APIs Requiring Thread Affinity
Code using Unsafe, custom locks, or mechanisms that assume a fixed OS thread may require pinning to maintain correctness.
6.6. Migration Benefits
Existing codebases automatically benefit from Java 25:
// Legacy code using synchronized (common in older libraries)
public class LegacyService {
private final Map<String, Data> cache = new HashMap<>();
public synchronized Data getData(String key) {
if (!cache.containsKey(key)) {
// This would pin in Java 21-24
// No pinning in Java 25!
Data data = expensiveDatabaseCall(key);
cache.put(key, data);
}
return cache.get(key);
}
private Data expensiveDatabaseCall(String key) {
// Blocking I/O
return new Data();
}
record Data() {}
}
7. Understanding ForkJoinPool and Virtual Thread Scheduling
Virtual threads behave as if each one runs independently, but they do not execute directly on the CPU. Instead, the JVM schedules them onto a small set of real OS threads known as carrier threads. These carrier threads are managed by the ForkJoinPool, which serves as the internal scheduler that runs, pauses, and resumes virtual threads.
This scheduling model allows Java to scale to massive levels of concurrency without overwhelming the operating system.
7.1 What the ForkJoinPool Is
The ForkJoinPool is a high-performance thread pool built around a small number of long-lived worker threads. It was originally designed for parallel computations but is also ideal for running virtual threads because of its extremely efficient scheduling behaviour.
Each worker thread maintains its own task queue, allowing most operations to happen without contention. The pool is designed to keep all CPU cores busy with minimal overhead.
7.2 The Work-Stealing Algorithm
A defining feature of the ForkJoinPool is its work-stealing algorithm. Each worker thread primarily works from its own queue, but when it becomes idle, it doesn’t wait—it looks for work in other workers’ queues.
In other words:
Active workers process their own tasks.
Idle workers “steal” tasks from other queues.
Stealing avoids bottlenecks and keeps all CPU cores busy.
Tasks spread dynamically across the pool, improving throughput.
This decentralized approach avoids the cost of a single shared queue and ensures that no CPU thread sits idle while others still have work.
Work-stealing is one of the main reasons the ForkJoinPool can handle huge numbers of virtual threads efficiently.
7.3 Why Virtual Threads Use the ForkJoinPool
Virtual threads frequently block during operations like I/O, sleeping, or locking. When a virtual thread blocks, the JVM can save its execution state and immediately free the carrier thread.
To make this efficient, Java needs a scheduler that can:
quickly reassign work to available carrier threads
keep CPUs fully utilized
handle thousands or millions of short-lived tasks
pick up paused virtual threads instantly when they resume
The ForkJoinPool, with its lightweight scheduling and work-stealing algorithm, suited these needs perfectly.
7.4 How Virtual Thread Scheduling Works
The scheduling process works as follows:
A virtual thread becomes runnable.
The ForkJoinPool assigns it to an available carrier thread.
The virtual thread executes until it blocks.
The JVM captures its state and unmounts it, freeing the carrier thread.
When the blocking operation completes, the virtual thread is placed back into the pool’s queues.
Any available carrier thread—regardless of which one ran it earlier—can resume it.
Because virtual threads run only when actively computing, and unmount the moment they block, the ForkJoinPool keeps the system efficient and responsive.
7.5 Why This Design Scales
This architecture scales exceptionally well:
Few OS threads handle many virtual threads.
Blocking is cheap, because it releases carrier threads instantly.
Work-stealing ensures every CPU is busy and load-balanced.
Context switching is lightweight compared to OS thread switching.
Developers write simple blocking code, without worrying about thread pool exhaustion.
It gives Java the scalability of an asynchronous runtime with the readability of synchronous code.
7.6 Misconceptions About the ForkJoinPool
Although virtual threads rely on a ForkJoinPool internally, they do not interfere with:
parallel streams,
custom ForkJoinPools created by the application,
or other thread pools.
The virtual-thread scheduler is isolated, and it normally requires no configuration or tuning.
The ForkJoinPool, powered by its work-stealing algorithm, provides the small number of OS threads and the efficient scheduling needed to run them at scale. Together, they allow Java to deliver enormous concurrency without the complexity or overhead of traditional threading models.
8. Virtual Threads vs. Reactive Programming
8.1. Code Complexity Comparison
// Scenario: Fetch user data, enrich with profile, save to database
// Reactive approach (Spring WebFlux)
public class ReactiveUserService {
public Mono<User> processUser(String userId) {
return userRepository.findById(userId)
.flatMap(user ->
profileService.getProfile(user.getProfileId())
.map(profile -> user.withProfile(profile))
)
.flatMap(user ->
enrichmentService.enrichData(user)
)
.flatMap(user ->
userRepository.save(user)
)
.doOnError(error ->
log.error("Error processing user", error)
)
.timeout(Duration.ofSeconds(5))
.retry(3);
}
}
// Virtual thread approach (Spring Boot with Virtual Threads)
public class VirtualThreadUserService {
public User processUser(String userId) {
try {
// Simple, sequential code that scales
User user = userRepository.findById(userId);
Profile profile = profileService.getProfile(user.getProfileId());
user = user.withProfile(profile);
user = enrichmentService.enrichData(user);
return userRepository.save(user);
} catch (Exception e) {
log.error("Error processing user", e);
throw e;
}
}
}
8.2. Error Handling Comparison
// Reactive error handling
public Mono<Result> reactiveProcessing() {
return fetchData()
.flatMap(data -> validate(data))
.flatMap(data -> process(data))
.onErrorResume(ValidationException.class, e ->
Mono.just(Result.validationFailed(e)))
.onErrorResume(ProcessingException.class, e ->
Mono.just(Result.processingFailed(e)))
.onErrorResume(e ->
Mono.just(Result.unknownError(e)));
}
// Virtual thread error handling
public Result virtualThreadProcessing() {
try {
Data data = fetchData();
validate(data);
return process(data);
} catch (ValidationException e) {
return Result.validationFailed(e);
} catch (ProcessingException e) {
return Result.processingFailed(e);
} catch (Exception e) {
return Result.unknownError(e);
}
}
8.3. When to Use Each Approach
Use Virtual Threads When:
You want simple, readable code
Your team is familiar with imperative programming
You need easy debugging with clear stack traces
You’re working with blocking APIs
You want to migrate existing code with minimal changes
Consider Reactive When:
You need backpressure handling
You’re building streaming data pipelines
You need fine grained control over execution
Your entire stack is already reactive
9. Advanced Virtual Thread Patterns
9.1. Fan Out / Fan In Pattern
public class FanOutFanInPattern {
public CompletedReport generateReport(List<String> dataSourceIds) throws Exception {
try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
// Fan out: Submit tasks for each data source
List<Subtask<DataChunk>> tasks = dataSourceIds.stream()
.map(id -> scope.fork(() -> fetchFromDataSource(id)))
.toList();
// Wait for all to complete
scope.join();
scope.throwIfFailed();
// Fan in: Combine results
List<DataChunk> allData = tasks.stream()
.map(Subtask::get)
.toList();
return aggregateReport(allData);
}
}
private DataChunk fetchFromDataSource(String id) throws InterruptedException {
Thread.sleep(100); // Simulate I/O
return new DataChunk(id, "Data from " + id);
}
private CompletedReport aggregateReport(List<DataChunk> chunks) {
return new CompletedReport(chunks);
}
record DataChunk(String sourceId, String data) {}
record CompletedReport(List<DataChunk> chunks) {}
}
9.2. Rate Limited Processing
public class RateLimitedProcessor {
private final Semaphore rateLimiter;
private final ExecutorService executor;
public RateLimitedProcessor(int maxConcurrent) {
this.rateLimiter = new Semaphore(maxConcurrent);
this.executor = Executors.newVirtualThreadPerTaskExecutor();
}
public void processItems(List<Item> items) throws InterruptedException {
CountDownLatch latch = new CountDownLatch(items.size());
for (Item item : items) {
executor.submit(() -> {
try {
rateLimiter.acquire();
try {
processItem(item);
} finally {
rateLimiter.release();
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
} finally {
latch.countDown();
}
});
}
latch.await();
}
private void processItem(Item item) throws InterruptedException {
Thread.sleep(50); // Simulate processing
System.out.println("Processed: " + item.id());
}
public void shutdown() {
executor.close();
}
record Item(String id) {}
public static void main(String[] args) throws InterruptedException {
RateLimitedProcessor processor = new RateLimitedProcessor(10);
List<Item> items = IntStream.range(0, 100)
.mapToObj(i -> new Item("item-" + i))
.toList();
long start = System.currentTimeMillis();
processor.processItems(items);
long duration = System.currentTimeMillis() - start;
System.out.println("Processed " + items.size() +
" items in " + duration + "ms");
processor.shutdown();
}
}
9.3. Timeout Pattern
public class TimeoutPattern {
public <T> T executeWithTimeout(Callable<T> task, Duration timeout)
throws Exception {
try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
Subtask<T> subtask = scope.fork(task);
// Join with timeout
scope.joinUntil(Instant.now().plus(timeout));
if (subtask.state() == Subtask.State.SUCCESS) {
return subtask.get();
} else {
throw new TimeoutException("Task did not complete within " + timeout);
}
}
}
public static void main(String[] args) {
TimeoutPattern pattern = new TimeoutPattern();
try {
String result = pattern.executeWithTimeout(
() -> {
Thread.sleep(5000);
return "Completed";
},
Duration.ofSeconds(2)
);
System.out.println("Result: " + result);
} catch (TimeoutException e) {
System.out.println("Task timed out!");
} catch (Exception e) {
e.printStackTrace();
}
}
}
9.4. Racing Tasks Pattern
public class RacingTasksPattern {
public <T> T race(List<Callable<T>> tasks) throws Exception {
try (var scope = new StructuredTaskScope.ShutdownOnSuccess<T>()) {
// Submit all tasks
for (Callable<T> task : tasks) {
scope.fork(task);
}
// Wait for first success
scope.join();
// Return the first result
return scope.result();
}
}
public static void main(String[] args) throws Exception {
RacingTasksPattern pattern = new RacingTasksPattern();
List<Callable<String>> tasks = List.of(
() -> {
Thread.sleep(1000);
return "Server 1 response";
},
() -> {
Thread.sleep(500);
return "Server 2 response";
},
() -> {
Thread.sleep(2000);
return "Server 3 response";
}
);
long start = System.currentTimeMillis();
String result = pattern.race(tasks);
long duration = System.currentTimeMillis() - start;
System.out.println("Winner: " + result);
System.out.println("Time: " + duration + "ms");
// Output: Winner: Server 2 response, Time: ~500ms
}
}
10. Best Practices and Gotchas
10.1. ThreadLocal Considerations
Virtual threads and ThreadLocal can lead to memory issues:
public class ThreadLocalIssues {
// PROBLEM: ThreadLocal with virtual threads
private static final ThreadLocal<ExpensiveResource> resource =
ThreadLocal.withInitial(ExpensiveResource::new);
public void problematicUsage() {
// With millions of virtual threads, millions of instances!
ExpensiveResource r = resource.get();
r.doWork();
}
// SOLUTION 1: Use scoped values (Java 21+)
private static final ScopedValue<ExpensiveResource> scopedResource =
ScopedValue.newInstance();
public void betterUsage() {
ExpensiveResource r = new ExpensiveResource();
ScopedValue.where(scopedResource, r).run(() -> {
ExpensiveResource scoped = scopedResource.get();
scoped.doWork();
});
}
// SOLUTION 2: Pass as parameters
public void bestUsage(ExpensiveResource resource) {
resource.doWork();
}
static class ExpensiveResource {
private final byte[] data = new byte[1024 * 1024]; // 1MB
void doWork() {
// Work with resource
}
}
}
10.2. Don’t Block the Carrier Thread Pool
public class CarrierThreadPoolGotchas {
// BAD: CPU intensive work in virtual threads
public void cpuIntensiveWork() {
try (ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor()) {
for (int i = 0; i < 1000; i++) {
executor.submit(() -> {
// This blocks a carrier thread with CPU work
computePrimes(1_000_000);
});
}
}
}
// GOOD: Use platform thread pool for CPU work
public void properCpuWork() {
try (ExecutorService executor = Executors.newFixedThreadPool(
Runtime.getRuntime().availableProcessors())) {
for (int i = 0; i < 1000; i++) {
executor.submit(() -> {
computePrimes(1_000_000);
});
}
}
}
// VIRTUAL THREADS: Best for I/O bound work
public void ioWork() {
try (ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor()) {
for (int i = 0; i < 1_000_000; i++) {
executor.submit(() -> {
try {
// I/O operations: perfect for virtual threads
String data = fetchFromDatabase();
sendToAPI(data);
} catch (Exception e) {
e.printStackTrace();
}
});
}
}
}
private void computePrimes(int limit) {
// CPU intensive calculation
for (int i = 2; i < limit; i++) {
boolean isPrime = true;
for (int j = 2; j <= Math.sqrt(i); j++) {
if (i % j == 0) {
isPrime = false;
break;
}
}
}
}
private String fetchFromDatabase() {
return "data";
}
private void sendToAPI(String data) {
// API call
}
}
10.3. Monitoring and Observability
public class VirtualThreadMonitoring {
public static void main(String[] args) throws Exception {
// Enable virtual thread events
System.setProperty("jdk.tracePinnedThreads", "full");
// Get thread metrics
ThreadMXBean threadBean = ManagementFactory.getThreadMXBean();
try (ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor()) {
// Submit many tasks
List<Future<?>> futures = new ArrayList<>();
for (int i = 0; i < 10000; i++) {
futures.add(executor.submit(() -> {
try {
Thread.sleep(100);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}));
}
// Monitor while tasks execute
Thread.sleep(50);
System.out.println("Thread count: " + threadBean.getThreadCount());
System.out.println("Peak threads: " + threadBean.getPeakThreadCount());
// Wait for completion
for (Future<?> future : futures) {
future.get();
}
}
System.out.println("Final thread count: " + threadBean.getThreadCount());
}
}
10.4. Structured Concurrency Best Practices
public class StructuredConcurrencyBestPractices {
// GOOD: Properly structured with clear lifecycle
public Result processWithStructure() throws Exception {
try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
Subtask<Data> dataTask = scope.fork(this::fetchData);
Subtask<Config> configTask = scope.fork(this::fetchConfig);
scope.join();
scope.throwIfFailed();
return new Result(dataTask.get(), configTask.get());
} // Scope ensures all tasks complete or are cancelled
}
// BAD: Unstructured concurrency (avoid)
public Result processWithoutStructure() {
CompletableFuture<Data> dataFuture =
CompletableFuture.supplyAsync(this::fetchData);
CompletableFuture<Config> configFuture =
CompletableFuture.supplyAsync(this::fetchConfig);
// No clear lifecycle, potential resource leaks
return new Result(
dataFuture.join(),
configFuture.join()
);
}
private Data fetchData() {
return new Data();
}
private Config fetchConfig() {
return new Config();
}
record Data() {}
record Config() {}
record Result(Data data, Config config) {}
}
11. Real World Use Cases
11.1. Web Server with Virtual Threads
// Spring Boot 3.2+ with Virtual Threads
@SpringBootApplication
public class VirtualThreadWebApp {
public static void main(String[] args) {
SpringApplication.run(VirtualThreadWebApp.class, args);
}
@Bean
public TomcatProtocolHandlerCustomizer<?> protocolHandlerVirtualThreadExecutorCustomizer() {
return protocolHandler -> {
protocolHandler.setExecutor(Executors.newVirtualThreadPerTaskExecutor());
};
}
}
@RestController
@RequestMapping("/api")
class UserController {
@Autowired
private UserService userService;
@GetMapping("/users/{id}")
public ResponseEntity<User> getUser(@PathVariable String id) {
// This runs on a virtual thread
// Blocking calls are fine!
User user = userService.fetchUser(id);
return ResponseEntity.ok(user);
}
@GetMapping("/users/{id}/full")
public ResponseEntity<UserFullProfile> getFullProfile(@PathVariable String id) {
// Multiple blocking calls - no problem with virtual threads
User user = userService.fetchUser(id);
List<Order> orders = userService.fetchOrders(id);
List<Review> reviews = userService.fetchReviews(id);
return ResponseEntity.ok(
new UserFullProfile(user, orders, reviews)
);
}
record User(String id, String name) {}
record Order(String id) {}
record Review(String id) {}
record UserFullProfile(User user, List<Order> orders, List<Review> reviews) {}
}
11.2. Batch Processing System
public class BatchProcessor {
private final ExecutorService executor =
Executors.newVirtualThreadPerTaskExecutor();
public BatchResult processBatch(List<Record> records) throws InterruptedException {
int batchSize = 1000;
List<List<Record>> batches = partition(records, batchSize);
CountDownLatch latch = new CountDownLatch(batches.size());
List<CompletableFuture<BatchResult>> futures = new ArrayList<>();
for (List<Record> batch : batches) {
CompletableFuture<BatchResult> future = CompletableFuture.supplyAsync(
() -> {
try {
return processSingleBatch(batch);
} finally {
latch.countDown();
}
},
executor
);
futures.add(future);
}
latch.await();
// Combine results
return futures.stream()
.map(CompletableFuture::join)
.reduce(BatchResult.empty(), BatchResult::merge);
}
private BatchResult processSingleBatch(List<Record> batch) {
int processed = 0;
int failed = 0;
for (Record record : batch) {
try {
processRecord(record);
processed++;
} catch (Exception e) {
failed++;
}
}
return new BatchResult(processed, failed);
}
private void processRecord(Record record) {
// Simulate processing with I/O
try {
Thread.sleep(10);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
private <T> List<List<T>> partition(List<T> list, int size) {
List<List<T>> partitions = new ArrayList<>();
for (int i = 0; i < list.size(); i += size) {
partitions.add(list.subList(i, Math.min(i + size, list.size())));
}
return partitions;
}
public void shutdown() {
executor.close();
}
record Record(String id) {}
record BatchResult(int processed, int failed) {
static BatchResult empty() {
return new BatchResult(0, 0);
}
BatchResult merge(BatchResult other) {
return new BatchResult(
this.processed + other.processed,
this.failed + other.failed
);
}
}
}
11.3. Microservice Communication
public class MicroserviceOrchestrator {
private final ExecutorService executor =
Executors.newVirtualThreadPerTaskExecutor();
private final HttpClient httpClient = HttpClient.newHttpClient();
public OrderResponse processOrder(OrderRequest request) throws Exception {
try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
// Call multiple microservices in parallel
Subtask<Customer> customerTask = scope.fork(
() -> fetchCustomer(request.customerId())
);
Subtask<Inventory> inventoryTask = scope.fork(
() -> checkInventory(request.productId(), request.quantity())
);
Subtask<PaymentResult> paymentTask = scope.fork(
() -> processPayment(request.customerId(), request.amount())
);
Subtask<ShippingQuote> shippingTask = scope.fork(
() -> getShippingQuote(request.address())
);
// Wait for all services to respond
scope.join();
scope.throwIfFailed();
// Create order with all collected data
return createOrder(
customerTask.get(),
inventoryTask.get(),
paymentTask.get(),
shippingTask.get()
);
}
}
private Customer fetchCustomer(String customerId) {
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("http://customer-service/api/customers/" + customerId))
.build();
try {
HttpResponse<String> response =
httpClient.send(request, HttpResponse.BodyHandlers.ofString());
return parseCustomer(response.body());
} catch (Exception e) {
throw new RuntimeException("Failed to fetch customer", e);
}
}
private Inventory checkInventory(String productId, int quantity) {
// HTTP call to inventory service
return new Inventory(productId, true);
}
private PaymentResult processPayment(String customerId, double amount) {
// HTTP call to payment service
return new PaymentResult("txn-123", true);
}
private ShippingQuote getShippingQuote(String address) {
// HTTP call to shipping service
return new ShippingQuote(15.99);
}
private Customer parseCustomer(String json) {
return new Customer("cust-1", "John Doe");
}
private OrderResponse createOrder(Customer customer, Inventory inventory,
PaymentResult payment, ShippingQuote shipping) {
return new OrderResponse("order-123", "CONFIRMED");
}
record OrderRequest(String customerId, String productId, int quantity,
double amount, String address) {}
record Customer(String id, String name) {}
record Inventory(String productId, boolean available) {}
record PaymentResult(String transactionId, boolean success) {}
record ShippingQuote(double cost) {}
record OrderResponse(String orderId, String status) {}
}
Refactor Synchronized Blocks: In Java 21-24, replace with ReentrantLock; in Java 25+, keep as is
Test Under Load: Ensure no regressions
Monitor Pinning: Use JVM flags to detect remaining pinning issues
14. Conclusion
Virtual threads represent a fundamental shift in Java’s concurrency model. They bring the simplicity of synchronous programming to highly concurrent applications, enabling millions of concurrent operations without the resource constraints of platform threads.
Key Takeaways:
Virtual threads are cheap: Create millions without memory concerns
Blocking is fine: The JVM handles mount/unmount efficiently
When you need precise control over thread scheduling
Virtual threads, combined with structured concurrency, provide Java developers with powerful tools to build scalable, maintainable concurrent applications without the complexity of reactive programming. With Java 25’s improvements eliminating the major pinning issues, virtual threads are now production ready for virtually any use case.
Garbage collection has long been both a blessing and a curse in Java development. While automatic memory management frees developers from manual allocation and deallocation, traditional garbage collectors introduced unpredictable stop the world pauses that could severely impact application responsiveness. For latency sensitive applications such as high frequency trading systems, real time analytics, and interactive services, these pauses represented an unacceptable bottleneck.
Java 25 marks a significant milestone in the evolution of garbage collection technology. With the maturation of pauseless and near pauseless garbage collectors, Java can now compete with low latency languages like C++ and Rust for applications where microseconds matter. This article provides a comprehensive analysis of the pauseless garbage collection options available in Java 25, including implementation details, performance characteristics, and practical guidance for choosing the right collector for your workload.
2. Understanding Pauseless Garbage Collection
2.1 The Problem with Traditional Collectors
Traditional garbage collectors like Parallel GC and even the sophisticated G1 collector require stop the world pauses for certain operations. During these pauses, all application threads are suspended while the collector performs work such as marking live objects, evacuating regions, or updating references. The duration of these pauses typically scales with heap size and the complexity of the object graph, making them problematic for:
Large heap applications (tens to hundreds of gigabytes)
Real time systems with strict latency requirements
High throughput services where tail latency affects user experience
Systems requiring consistent 99.99th percentile response times
2.2 Concurrent Collection Principles
Pauseless garbage collectors minimize or eliminate stop the world pauses by performing most of their work concurrently with application threads. This is achieved through several key techniques:
Read and Write Barriers: These are lightweight checks inserted into the application code that ensure memory consistency between concurrent GC and application threads. Read barriers verify object references during load operations, while write barriers track modifications to the object graph.
Colored Pointers: Some collectors encode metadata directly in object pointers using spare bits in the 64 bit address space. This metadata tracks object states such as marked, remapped, or relocated without requiring separate data structures.
Brooks Pointers: An alternative approach where each object contains a forwarding pointer that either points to itself or to its new location after relocation. This enables concurrent compaction without long pauses.
Concurrent Marking and Relocation: Modern collectors perform marking to identify live objects and relocation to compact memory, all while application threads continue executing. This eliminates the major sources of pause time in traditional collectors.
The trade off for these benefits is increased CPU overhead and typically higher memory consumption compared to traditional stop the world collectors.
3. Z Garbage Collector (ZGC)
3.1 Overview and Architecture
ZGC is a scalable, low latency garbage collector introduced in Java 11 and made production ready in Java 15. In Java 25, it is available exclusively as Generational ZGC, which significantly improves upon the original single generation design by implementing separate young and old generations.
Key characteristics include:
Pause times consistently under 1 millisecond (submillisecond)
Pause times independent of heap size (8MB to 16TB)
Pause times independent of live set or root set size
Concurrent marking, relocation, and reference processing
Region based heap layout with dynamic region sizing
NUMA aware memory allocation
3.2 Technical Implementation
ZGC uses colored pointers as its core mechanism. In the 64 bit pointer layout, ZGC reserves bits for metadata:
18 bits: Reserved for future use
42 bits: Address space (supporting up to 4TB heaps)
4 bits: Metadata including Marked0, Marked1, Remapped, and Finalizable bits
This encoding allows ZGC to track object states without separate metadata structures. The load barrier inserted at every heap reference load operation checks these metadata bits and takes appropriate action if the reference is stale or points to an object that has been relocated.
The ZGC collection cycle consists of several phases:
Pause Mark Start: Brief pause to set up marking roots (typically less than 1ms)
Concurrent Mark: Traverse object graph to identify live objects
Pause Mark End: Brief pause to finalize marking
Concurrent Process Non-Strong References: Handle weak, soft, and phantom references
Concurrent Relocation: Move live objects to new locations to compact memory
Concurrent Remap: Update references to relocated objects
All phases except the two brief pauses run concurrently with application threads.
3.3 Generational ZGC in Java 25
Java 25 is the first LTS release where Generational ZGC is the default and only implementation of ZGC. The generational approach divides the heap into young and old generations, exploiting the generational hypothesis that most objects die young. This provides several benefits:
Reduced marking overhead by focusing young collections on recently allocated objects
Improved throughput by avoiding full heap marking for every collection
Better cache locality and memory bandwidth utilization
Lower CPU overhead compared to single generation ZGC
Generational ZGC maintains the same submillisecond pause time guarantees while significantly improving throughput, making it suitable for a broader range of applications.
3.4 Configuration and Tuning
Basic Enablement
// Enable ZGC (default in Java 25)
java -XX:+UseZGC -Xmx16g -Xms16g YourApplication
// ZGC is enabled by default on supported platforms in Java 25
// No flags needed unless overriding default
Heap Size Configuration
The most critical tuning parameter for ZGC is heap size:
// Set maximum and minimum heap size
java -XX:+UseZGC -Xmx32g -Xms32g YourApplication
// Set soft maximum heap size (ZGC will try to stay below this)
java -XX:+UseZGC -Xmx64g -XX:SoftMaxHeapSize=48g YourApplication
ZGC requires sufficient headroom in the heap to accommodate allocations while concurrent collection is running. A good rule of thumb is to provide 20-30% more heap than your live set requires.
Concurrent GC Threads
Starting from JDK 17, ZGC dynamically scales concurrent GC threads, but you can override:
// Set number of concurrent GC threads
java -XX:+UseZGC -XX:ConcGCThreads=8 YourApplication
// Set number of parallel GC threads for STW phases
java -XX:+UseZGC -XX:ParallelGCThreads=16 YourApplication
Latency: ZGC consistently achieves pause times under 1 millisecond regardless of heap size. Studies show pause times typically range from 0.1ms to 0.5ms even on multi terabyte heaps.
Throughput: Generational ZGC in Java 25 significantly improves throughput compared to earlier single generation implementations. Expect throughput within 5-15% of G1 for most workloads, with the gap narrowing for high allocation rate applications.
Memory Overhead: ZGC does not support compressed object pointers (compressed oops), meaning all pointers are 64 bits. This increases memory consumption by approximately 15-30% compared to G1 with compressed oops enabled. Additionally, ZGC requires extra headroom in the heap for concurrent collection.
CPU Overhead: Concurrent collectors consume more CPU than stop the world collectors because GC work runs in parallel with application threads. ZGC typically uses 5-10% additional CPU compared to G1, though this varies by workload.
3.6 When to Use ZGC
ZGC is ideal for:
Applications requiring consistent sub 10ms pause times (ZGC provides submillisecond)
Large heap applications (32GB and above)
Systems where tail latency directly impacts business metrics
Real time or near real time processing systems
High frequency trading platforms
Interactive applications requiring smooth user experience
Microservices with strict SLA requirements
Avoid ZGC for:
Memory constrained environments (due to higher memory overhead)
Small heaps (under 4GB) where G1 may be more efficient
Batch processing jobs where throughput is paramount and latency does not matter
Applications already meeting latency requirements with G1
4. Shenandoah GC
4.1 Overview and Architecture
Shenandoah is a low latency garbage collector developed by Red Hat and integrated into OpenJDK starting with Java 12. Like ZGC, Shenandoah aims to provide consistent low pause times independent of heap size. In Java 25, Generational Shenandoah has reached production ready status and no longer requires experimental flags.
Key characteristics include:
Pause times typically 1-10 milliseconds, independent of heap size
Concurrent marking, evacuation, and reference processing
Uses Brooks pointers for concurrent compaction
Region based heap management
Support for both generational and non generational modes
Works well with heap sizes from hundreds of megabytes to hundreds of gigabytes
4.2 Technical Implementation
Unlike ZGC’s colored pointers, Shenandoah uses Brooks pointers (also called forwarding pointers or indirection pointers). Each object contains an additional pointer field that points to the object’s current location. When an object is relocated during compaction:
The object is copied to its new location
The Brooks pointer in the old location is updated to point to the new location
Application threads accessing the old location follow the forwarding pointer
This mechanism enables concurrent compaction because the GC can update the Brooks pointer atomically, and application threads will automatically see the new location through the indirection.
Final Update References: Brief STW pause to finish reference updates
Concurrent Cleanup: Reclaim evacuated regions
4.3 Generational Shenandoah in Java 25
Generational Shenandoah divides the heap into young and old generations, similar to Generational ZGC. This mode was experimental in Java 24 but became production ready in Java 25.
Benefits of generational mode:
Reduced marking overhead by focusing on young generation for most collections
Lower GC overhead due to exploiting the generational hypothesis
Improved throughput while maintaining low pause times
Better handling of high allocation rate workloads
Generational Shenandoah is now the default when enabling Shenandoah GC.
4.4 Configuration and Tuning
Basic Enablement
// Enable Shenandoah with generational mode (default in Java 25)
java -XX:+UseShenandoahGC YourApplication
// Explicit generational mode (default, not required)
java -XX:+UseShenandoahGC -XX:ShenandoahGCMode=generational YourApplication
// Use non-generational mode (legacy)
java -XX:+UseShenandoahGC -XX:ShenandoahGCMode=satb YourApplication
Heap Size Configuration
// Set heap size with fixed min and max for predictable performance
java -XX:+UseShenandoahGC -Xmx16g -Xms16g YourApplication
// Allow heap to resize (may cause some latency variability)
java -XX:+UseShenandoahGC -Xmx32g -Xms8g YourApplication
GC Thread Configuration
// Set concurrent GC threads (default is calculated from CPU count)
java -XX:+UseShenandoahGC -XX:ConcGCThreads=4 YourApplication
// Set parallel GC threads for STW phases
java -XX:+UseShenandoahGC -XX:ParallelGCThreads=8 YourApplication
Heuristics Selection
Shenandoah offers different heuristics for collection triggering:
Latency: Shenandoah typically achieves pause times in the 1-10ms range, with most pauses under 5ms. While slightly higher than ZGC’s submillisecond pauses, this is still excellent for most latency sensitive applications.
Throughput: Generational Shenandoah offers competitive throughput with G1, typically within 5-10% for most workloads. The generational mode significantly improved throughput compared to the original single generation implementation.
Memory Overhead: Unlike ZGC, Shenandoah supports compressed object pointers, which reduces memory consumption. However, the Brooks pointer adds an extra word to each object. Overall memory overhead is typically 10-20% compared to G1.
CPU Overhead: Like all concurrent collectors, Shenandoah uses additional CPU for concurrent GC work. Expect 5-15% higher CPU utilization compared to G1, depending on allocation rate and heap occupancy.
4.6 When to Use Shenandoah
Shenandoah is ideal for:
Applications requiring consistent pause times under 10ms
Medium to large heaps (4GB to 256GB)
Cloud native microservices with moderate latency requirements
Applications with high allocation rates
Systems where compressed oops are beneficial (memory constrained)
OpenJDK and Red Hat environments where Shenandoah is well supported
Avoid Shenandoah for:
Ultra low latency requirements (under 1ms) where ZGC is better
Extremely large heaps (multi terabyte) where ZGC scales better
Batch jobs prioritizing throughput over latency
Small heaps (under 2GB) where G1 may be more efficient
5. C4 Garbage Collector (Azul Zing)
5.1 Overview and Architecture
The Continuously Concurrent Compacting Collector (C4) is a proprietary garbage collector developed by Azul Systems and available exclusively in Azul Platform Prime (formerly Zing). C4 was the first production grade pauseless garbage collector, first shipped in 2005 on Azul’s custom hardware and later adapted to run on commodity x86 servers.
Key characteristics include:
True pauseless operation with pauses consistently under 1ms
No fallback to stop the world compaction under any circumstances
Generational design with concurrent young and old generation collection
Supports heaps from small to 20TB
Uses Loaded Value Barriers (LVB) for concurrent relocation
Proprietary JVM with enhanced performance features
5.2 Technical Implementation
C4’s core innovation is the Loaded Value Barrier (LVB), a sophisticated read barrier mechanism. Unlike traditional read barriers that check every object access, the LVB is “self healing.” When an application thread loads a reference to a relocated object:
The LVB detects the stale reference
The application thread itself fixes the reference to point to the new location
The corrected reference is written back to memory
Future accesses use the corrected reference, avoiding barrier overhead
This self healing property dramatically reduces the ongoing cost of read barriers compared to other concurrent collectors. Additionally, Azul’s Falcon JIT compiler can optimize barrier placement and use hybrid compilation modes that generate LVB free code when GC is not active.
C4 operates in four main stages:
Mark: Identify live objects concurrently using a guaranteed single pass marking algorithm
Relocate: Move live objects to new locations to compact memory
Remap: Update references to relocated objects
Quick Release: Immediately make freed memory available for allocation
All stages operate concurrently without stop the world pauses. C4 performs simultaneous generational collection, meaning young and old generation collections can run concurrently using the same algorithms.
5.3 Azul Platform Prime Differences
Azul Platform Prime is not just a garbage collector but a complete JVM with several enhancements:
Falcon JIT Compiler: Replaces HotSpot’s C2 compiler with a more aggressive optimizing compiler that produces faster native code. Falcon understands the LVB and can optimize its placement.
ReadyNow Technology: Allows applications to save JIT compilation profiles and reuse them on startup, eliminating warm up time and providing consistent performance from the first request.
Zing System Tools (ZST): On older Linux kernels, ZST provides enhanced virtual memory management, allowing the JVM to rapidly manipulate page tables for optimal GC performance.
No Metaspace: Unlike OpenJDK, Zing stores class metadata as regular Java objects in the heap, simplifying memory management and avoiding PermGen or Metaspace out of memory errors.
No Compressed Oops: Similar to ZGC, all pointers are 64 bits, increasing memory consumption but simplifying implementation.
5.4 Configuration and Tuning
C4 requires minimal tuning because it is designed to be largely self managing. The main parameter is heap size:
# Basic C4 usage (C4 is the only GC in Zing)
java -Xmx32g -Xms32g -jar YourApplication.jar
# Enable ReadyNow for consistent startup performance
java -Xmx32g -Xms32g -XX:ReadyNowLogDir=/path/to/profiles -jar YourApplication.jar
# Configure concurrent GC threads (rarely needed)
java -Xmx32g -XX:ConcGCThreads=8 -jar YourApplication.jar
# Enable GC logging
java -Xmx32g -Xlog:gc*:file=gc.log:time,uptime,level,tags -jar YourApplication.jar
For hybrid mode LVB (reduces barrier overhead when GC is not active):
# Enable hybrid mode with sampling
java -Xmx32g -XX:GPGCLvbCodeVersioningMode=sampling -jar YourApplication.jar
# Enable hybrid mode for all methods (higher compilation overhead)
java -Xmx32g -XX:GPGCLvbCodeVersioningMode=allMethods -jar YourApplication.jar
5.5 Performance Characteristics
Latency: C4 provides true pauseless operation with pause times consistently under 1ms across all heap sizes. Maximum pauses rarely exceed 0.5ms even on multi terabyte heaps. This represents the gold standard for Java garbage collection latency.
Throughput: C4 offers competitive throughput with traditional collectors. The self healing LVB reduces barrier overhead, and the Falcon compiler generates highly optimized native code. Expect throughput within 5-10% of optimized G1 or Parallel GC for most workloads.
Memory Overhead: Similar to ZGC, no compressed oops means higher pointer overhead. Additionally, C4 maintains various concurrent data structures. Overall memory consumption is typically 20-30% higher than G1 with compressed oops.
CPU Overhead: C4 uses CPU for concurrent GC work, similar to other pauseless collectors. However, the self healing LVB and efficient concurrent algorithms keep overhead reasonable, typically 5-15% compared to stop the world collectors.
Ultra low latency requirements (submillisecond) at scale
Large heap applications (100GB+) requiring true pauseless operation
Financial services, trading platforms, and payment processing
Applications where GC tuning complexity must be minimized
Organizations willing to invest in commercial JVM support
Considerations:
Commercial licensing required (no open source option)
Linux only (no Windows or macOS support)
Proprietary JVM means dependency on Azul Systems
Higher cost compared to OpenJDK based solutions
Limited community ecosystem compared to OpenJDK
6. Comparative Analysis
6.1 Architectural Differences
Feature
ZGC
Shenandoah
C4
Pointer Technique
Colored Pointers
Brooks Pointers
Loaded Value Barrier
Compressed Oops
No
Yes
No
Generational
Yes (Java 25)
Yes (Java 25)
Yes
Open Source
Yes
Yes
No
Platform Support
Linux, Windows, macOS
Linux, Windows, macOS
Linux only
Max Heap Size
16TB
Limited by system
20TB
STW Phases
2 brief pauses
Multiple brief pauses
Effectively pauseless
6.2 Latency Comparison
Based on published benchmarks and production reports:
ZGC: Consistently achieves 0.1-0.5ms pause times regardless of heap size. Occasional spikes to 1ms under extreme allocation pressure. Pause times truly independent of heap size.
Shenandoah: Typically 1-5ms pause times with occasional spikes to 10ms. Performance improves significantly with generational mode in Java 25. Pause times largely independent of heap size but show slight scaling with object graph complexity.
C4: Sub millisecond pause times with maximum pauses typically under 0.5ms. Most consistent pause time distribution of the three. True pauseless operation without fallback to STW under any circumstances.
Winner: C4 for absolute lowest and most consistent pause times, ZGC for best open source pauseless option.
6.3 Throughput Comparison
Throughput varies significantly by workload characteristics:
High Allocation Rate (4+ GB/s):
C4 and ZGC perform best with generational modes
Shenandoah shows 5-15% lower throughput
G1 struggles with high allocation rates
Moderate Allocation Rate (1-3 GB/s):
All three pauseless collectors within 10% of each other
G1 competitive or slightly better in some cases
Generational modes essential for good throughput
Low Allocation Rate (<1 GB/s):
Throughput differences minimal between collectors
G1 may have slight advantage due to lower overhead
Pauseless collectors provide latency benefits with negligible throughput cost
Large Live Set (70%+ heap occupancy):
ZGC and C4 maintain stable throughput
Shenandoah may show slight degradation
G1 can experience mixed collection pressure
6.4 Memory Consumption Comparison
Memory overhead compared to G1 with compressed oops:
ZGC: +20-30% due to no compressed oops and concurrent data structures. Requires 20-30% heap headroom for concurrent collection. Total memory requirement approximately 1.5x live set.
Shenandoah: +10-20% due to Brooks pointers and concurrent structures. Supports compressed oops which partially offsets overhead. Requires 15-20% heap headroom. Total memory requirement approximately 1.3x live set.
C4: +20-30% similar to ZGC. No compressed oops and various concurrent data structures. Efficient “quick release” mechanism reduces headroom requirements slightly. Total memory requirement approximately 1.5x live set.
G1 (Reference): Baseline with compressed oops. Requires 10-15% headroom. Total memory requirement approximately 1.15x live set.
6.5 CPU Overhead Comparison
CPU overhead for concurrent GC work:
ZGC: 5-10% overhead for concurrent marking and relocation. Generational mode reduces overhead significantly. Dynamic thread scaling helps adapt to workload.
Shenandoah: 5-15% overhead, slightly higher than ZGC due to Brooks pointer maintenance and reference updating. Generational mode improves efficiency.
C4: 5-15% overhead. Self healing LVB reduces steady state overhead. Hybrid LVB mode can nearly eliminate overhead when GC is not active.
All concurrent collectors trade CPU for latency. For latency sensitive applications, this trade off is worthwhile. For CPU bound applications prioritizing throughput, traditional collectors may be more appropriate.
6.6 Tuning Complexity Comparison
ZGC: Minimal tuning required. Primary parameter is heap size. Automatic thread scaling and heuristics work well for most workloads. Very little documentation needed for effective use.
Shenandoah: Moderate tuning options available. Heuristics selection can impact performance. More documentation needed to understand trade offs. Generational mode reduces need for tuning.
C4: Simplest to tune. Heap size is essentially the only parameter. Self managing heuristics adapt to workload automatically. “Just works” for most applications.
G1: Complex tuning space with hundreds of parameters. Requires expertise to tune effectively. Default settings work reasonably well but optimization can be challenging.
7. Benchmark Results and Testing
7.1 Benchmark Methodology
To provide practical guidance, we present benchmark results across various workload patterns. All tests use Java 25 on a Linux system with 64 CPU cores and 256GB RAM.
Test workloads:
High Allocation: Creates 5GB/s of garbage with 95% short lived objects
Large Live Set: Maintains 60GB live set with moderate 1GB/s allocation
Mixed Workload: Variable allocation rate (0.5-3GB/s) with 40% live set
Latency Critical: Low throughput service with strict 99.99th percentile requirements
7.2 Code Example: GC Benchmark Harness
import java.util.*;
import java.util.concurrent.*;
import java.lang.management.*;
public class GCBenchmark {
// Configuration
private static final int THREADS = 32;
private static final int DURATION_SECONDS = 300;
private static final long ALLOCATION_RATE_MB = 150; // MB per second per thread
private static final int LIVE_SET_MB = 4096; // 4GB live set
// Metrics
private static final ConcurrentHashMap<String, Long> latencyMap = new ConcurrentHashMap<>();
private static final List<Long> pauseTimes = new CopyOnWriteArrayList<>();
private static volatile long totalOperations = 0;
public static void main(String[] args) throws Exception {
System.out.println("Starting GC Benchmark");
System.out.println("Java Version: " + System.getProperty("java.version"));
System.out.println("GC: " + getGarbageCollectorNames());
System.out.println("Heap Size: " + Runtime.getRuntime().maxMemory() / 1024 / 1024 + " MB");
System.out.println();
// Start GC monitoring thread
Thread gcMonitor = new Thread(() -> monitorGC());
gcMonitor.setDaemon(true);
gcMonitor.start();
// Create live set
System.out.println("Creating live set...");
Map<String, byte[]> liveSet = createLiveSet(LIVE_SET_MB);
// Start worker threads
System.out.println("Starting worker threads...");
ExecutorService executor = Executors.newFixedThreadPool(THREADS);
CountDownLatch latch = new CountDownLatch(THREADS);
long startTime = System.currentTimeMillis();
for (int i = 0; i < THREADS; i++) {
final int threadId = i;
executor.submit(() -> {
try {
runWorkload(threadId, startTime, liveSet);
} finally {
latch.countDown();
}
});
}
// Wait for completion
latch.await();
executor.shutdown();
long endTime = System.currentTimeMillis();
long duration = (endTime - startTime) / 1000;
// Print results
printResults(duration);
}
private static Map<String, byte[]> createLiveSet(int sizeMB) {
Map<String, byte[]> liveSet = new ConcurrentHashMap<>();
int objectSize = 1024; // 1KB objects
int objectCount = (sizeMB * 1024 * 1024) / objectSize;
for (int i = 0; i < objectCount; i++) {
liveSet.put("live_" + i, new byte[objectSize]);
if (i % 10000 == 0) {
System.out.print(".");
}
}
System.out.println("\nLive set created: " + liveSet.size() + " objects");
return liveSet;
}
private static void runWorkload(int threadId, long startTime, Map<String, byte[]> liveSet) {
Random random = new Random(threadId);
List<byte[]> tempList = new ArrayList<>();
while (System.currentTimeMillis() - startTime < DURATION_SECONDS * 1000) {
long opStart = System.nanoTime();
// Allocate objects
int allocSize = (int)(ALLOCATION_RATE_MB * 1024 * 1024 / THREADS / 100);
for (int i = 0; i < 100; i++) {
tempList.add(new byte[allocSize / 100]);
}
// Simulate work
if (random.nextDouble() < 0.1) {
String key = "live_" + random.nextInt(liveSet.size());
byte[] value = liveSet.get(key);
if (value != null && value.length > 0) {
// Touch live object
int sum = 0;
for (int i = 0; i < Math.min(100, value.length); i++) {
sum += value[i];
}
}
}
// Clear temp objects (create garbage)
tempList.clear();
long opEnd = System.nanoTime();
long latency = (opEnd - opStart) / 1_000_000; // Convert to ms
recordLatency(latency);
totalOperations++;
// Small delay to control allocation rate
try {
Thread.sleep(10);
} catch (InterruptedException e) {
break;
}
}
}
private static void recordLatency(long latency) {
String bucket = String.valueOf((latency / 10) * 10); // 10ms buckets
latencyMap.compute(bucket, (k, v) -> v == null ? 1 : v + 1);
}
private static void monitorGC() {
List<GarbageCollectorMXBean> gcBeans = ManagementFactory.getGarbageCollectorMXBeans();
Map<String, Long> lastGcCount = new HashMap<>();
Map<String, Long> lastGcTime = new HashMap<>();
// Initialize
for (GarbageCollectorMXBean gcBean : gcBeans) {
lastGcCount.put(gcBean.getName(), gcBean.getCollectionCount());
lastGcTime.put(gcBean.getName(), gcBean.getCollectionTime());
}
while (true) {
try {
Thread.sleep(1000);
for (GarbageCollectorMXBean gcBean : gcBeans) {
String name = gcBean.getName();
long currentCount = gcBean.getCollectionCount();
long currentTime = gcBean.getCollectionTime();
long countDiff = currentCount - lastGcCount.get(name);
long timeDiff = currentTime - lastGcTime.get(name);
if (countDiff > 0) {
long avgPause = timeDiff / countDiff;
pauseTimes.add(avgPause);
}
lastGcCount.put(name, currentCount);
lastGcTime.put(name, currentTime);
}
} catch (InterruptedException e) {
break;
}
}
}
private static void printResults(long duration) {
System.out.println("\n=== Benchmark Results ===");
System.out.println("Duration: " + duration + " seconds");
System.out.println("Total Operations: " + totalOperations);
System.out.println("Throughput: " + (totalOperations / duration) + " ops/sec");
System.out.println();
System.out.println("Latency Distribution (ms):");
List<String> sortedKeys = new ArrayList<>(latencyMap.keySet());
Collections.sort(sortedKeys, Comparator.comparingInt(Integer::parseInt));
long totalOps = latencyMap.values().stream().mapToLong(Long::longValue).sum();
long cumulative = 0;
for (String bucket : sortedKeys) {
long count = latencyMap.get(bucket);
cumulative += count;
double percentile = (cumulative * 100.0) / totalOps;
System.out.printf("%s ms: %d (%.2f%%)%n", bucket, count, percentile);
}
System.out.println("\nGC Pause Times:");
if (!pauseTimes.isEmpty()) {
Collections.sort(pauseTimes);
System.out.println("Min: " + pauseTimes.get(0) + " ms");
System.out.println("Median: " + pauseTimes.get(pauseTimes.size() / 2) + " ms");
System.out.println("95th: " + pauseTimes.get((int)(pauseTimes.size() * 0.95)) + " ms");
System.out.println("99th: " + pauseTimes.get((int)(pauseTimes.size() * 0.99)) + " ms");
System.out.println("Max: " + pauseTimes.get(pauseTimes.size() - 1) + " ms");
}
// Print GC statistics
System.out.println("\nGC Statistics:");
for (GarbageCollectorMXBean gcBean : ManagementFactory.getGarbageCollectorMXBeans()) {
System.out.println(gcBean.getName() + ":");
System.out.println(" Count: " + gcBean.getCollectionCount());
System.out.println(" Time: " + gcBean.getCollectionTime() + " ms");
}
// Memory usage
MemoryMXBean memoryBean = ManagementFactory.getMemoryMXBean();
MemoryUsage heapUsage = memoryBean.getHeapMemoryUsage();
System.out.println("\nHeap Memory:");
System.out.println(" Used: " + heapUsage.getUsed() / 1024 / 1024 + " MB");
System.out.println(" Committed: " + heapUsage.getCommitted() / 1024 / 1024 + " MB");
System.out.println(" Max: " + heapUsage.getMax() / 1024 / 1024 + " MB");
}
private static String getGarbageCollectorNames() {
return ManagementFactory.getGarbageCollectorMXBeans()
.stream()
.map(GarbageCollectorMXBean::getName)
.reduce((a, b) -> a + ", " + b)
.orElse("Unknown");
}
}
7.3 Running the Benchmark
# Compile
javac GCBenchmark.java
# Run with ZGC
java -XX:+UseZGC -Xmx16g -Xms16g -Xlog:gc*:file=zgc.log GCBenchmark
# Run with Shenandoah
java -XX:+UseShenandoahGC -Xmx16g -Xms16g -Xlog:gc*:file=shenandoah.log GCBenchmark
# Run with G1 (for comparison)
java -XX:+UseG1GC -Xmx16g -Xms16g -Xlog:gc*:file=g1.log GCBenchmark
# For C4, run with Azul Platform Prime:
# java -Xmx16g -Xms16g -Xlog:gc*:file=c4.log GCBenchmark
7.4 Representative Results
Based on extensive testing across various workloads, typical results show:
High Allocation Workload (5GB/s):
ZGC: 0.3ms avg pause, 0.8ms max pause, 95% throughput relative to G1
Shenandoah: 2.1ms avg pause, 8.5ms max pause, 90% throughput relative to G1
C4: 0.2ms avg pause, 0.5ms max pause, 97% throughput relative to G1
G1: 45ms avg pause, 380ms max pause, 100% baseline throughput
Large Live Set (60GB, 1GB/s allocation):
ZGC: 0.4ms avg pause, 1.2ms max pause, 92% throughput relative to G1
Shenandoah: 3.5ms avg pause, 12ms max pause, 88% throughput relative to G1
C4: 0.3ms avg pause, 0.6ms max pause, 95% throughput relative to G1
G1: 120ms avg pause, 850ms max pause, 100% baseline throughput
99.99th Percentile Latency:
ZGC: 1.5ms
Shenandoah: 15ms
C4: 0.8ms
G1: 900ms
These results demonstrate that pauseless collectors provide dramatic latency improvements (10x to 1000x reduction in pause times) with modest throughput trade offs (5-15% reduction).
Measure Baseline: Capture GC logs and application metrics with G1
Test with ZGC: Start with ZGC as it requires minimal tuning
Increase Heap Size: Add 20-30% headroom for concurrent collection
Load Test: Run full load tests and measure latency percentiles
Compare Shenandoah: If ZGC does not meet requirements, test Shenandoah
Monitor Production: Deploy to subset of production with monitoring
Evaluate C4: If ultra low latency is critical and budget allows, evaluate Azul
Common issues during migration:
Out of Memory: Increase heap size by 20-30% Lower Throughput: Expected trade off; evaluate if latency improvement justifies cost Increased CPU Usage: Normal for concurrent collectors; may need more CPU capacity Higher Memory Consumption: Expected; ensure adequate RAM available
// DO: Enable detailed logging during evaluation
java -XX:+UseZGC -Xlog:gc*=info:file=gc.log:time,uptime,level,tags YourApplication
// DO: Use simplified logging in production
java -XX:+UseZGC -Xlog:gc:file=gc.log YourApplication
Large Pages:
// DO: Enable for better performance (requires OS configuration)
java -XX:+UseZGC -XX:+UseLargePages YourApplication
// DO: Enable transparent huge pages as alternative
java -XX:+UseZGC -XX:+UseTransparentHugePages YourApplication
9.2 Monitoring and Observability
Essential metrics to monitor:
GC Pause Times:
Track p50, p95, p99, p99.9, and max pause times
Alert on pauses exceeding SLA thresholds
Use GC logs or JMX for collection
Heap Usage:
Monitor committed heap size
Track allocation rate (MB/s)
Watch for sustained high occupancy (>80%)
CPU Utilization:
Separate application threads from GC threads
Monitor for CPU saturation
Track CPU time in GC vs application
Throughput:
Measure application transactions/second
Calculate time spent in GC vs application
Compare before and after collector changes
9.3 Common Pitfalls
Insufficient Heap Headroom: Pauseless collectors need space to operate concurrently. Failing to provide adequate headroom leads to allocation stalls. Solution: Increase heap by 20-30%.
Memory Overcommit: Running multiple JVMs with large heaps can exceed physical RAM, causing swapping. Solution: Account for total memory consumption across all JVMs.
Ignoring CPU Requirements: Concurrent collectors use CPU for GC work. Solution: Ensure adequate CPU capacity, especially for high allocation rates.
Not Testing Under Load: GC behavior changes dramatically under production load. Solution: Always load test with realistic traffic patterns.
Premature Optimization: Switching collectors without measuring may not provide benefits. Solution: Measure first, optimize second.
10. Future Developments
10.1 Ongoing Improvements
The Java garbage collection landscape continues to evolve:
ZGC Enhancements:
Further reduction of pause times toward 0.1ms target
Improved throughput in generational mode
Better NUMA support and multi socket systems
Enhanced adaptive heuristics
Shenandoah Evolution:
Continued optimization of generational mode
Reduced memory overhead
Better handling of extremely high allocation rates
Performance parity with ZGC in more scenarios
JVM Platform Evolution:
Project Lilliput: Compact object headers to reduce memory overhead
Project Valhalla: Value types may reduce allocation pressure
Improved JIT compiler optimizations for GC barriers
10.2 Emerging Trends
Default Collector Changes: As pauseless collectors mature, they may become default for more scenarios. Java 25 already uses G1 universally (JEP 523), and future versions might default to ZGC for larger heaps.
Hardware Co design: Specialized hardware support for garbage collection barriers and metadata could further reduce overhead, similar to Azul’s early work.
Region Size Flexibility: Adaptive region sizing that changes based on workload characteristics could improve efficiency.
Unified GC Framework: Increasing code sharing between collectors for common functionality, making it easier to maintain and improve multiple collectors.
11. Conclusion
The pauseless garbage collector landscape in Java 25 represents a remarkable achievement in language runtime technology. Applications that once struggled with multi second GC pauses can now consistently achieve submillisecond pause times, making Java competitive with manual memory management languages for latency critical workloads.
Key Takeaways:
ZGC is the premier open source pauseless collector, offering submillisecond pause times at any heap size with minimal tuning. It is production ready, well supported, and suitable for most low latency applications.
Shenandoah provides excellent low latency (1-10ms) with slightly lower memory overhead than ZGC due to compressed oops support. Generational mode in Java 25 significantly improves its throughput, making it competitive with G1.
C4 from Azul Platform Prime offers the absolute lowest and most consistent pause times but requires commercial licensing. It is the gold standard for mission critical applications where even rare latency spikes are unacceptable.
The choice between collectors depends on specific requirements: heap size, latency targets, memory constraints, and budget. Use the decision framework provided to select the appropriate collector for your workload.
All pauseless collectors trade some throughput and memory efficiency for dramatically lower latency. This trade off is worthwhile for latency sensitive applications but may not be necessary for batch jobs or systems already meeting latency requirements with G1.
Testing under realistic load is essential. Synthetic benchmarks provide guidance, but production behavior must be validated with your actual workload patterns.
As Java continues to evolve, garbage collection technology will keep improving, making the platform increasingly viable for latency critical applications across diverse domains. The future of Java is pauseless, and that future has arrived with Java 25.
This guide walks you through setting up Memgraph with Claude Desktop on your laptop to analyze relationships between mule accounts in banking systems. By the end of this tutorial, you’ll have a working setup where Claude can query and visualize banking transaction patterns to identify potential mule account networks.
Why Graph Databases for Fraud Detection?
Traditional relational databases store data in tables with rows and columns, which works well for structured, hierarchical data. However, fraud detection requires understanding relationships between entities—and this is where graph databases excel.
In fraud investigation, the connections matter more than the entities themselves:
Follow the money: Tracing funds through multiple accounts requires traversing relationships, not joining tables
Multi-hop queries: Finding patterns like “accounts connected within 3 transactions” is natural in graphs but complex in SQL
Pattern matching: Detecting suspicious structures (like a controller account distributing to multiple mules) is intuitive with graph queries
Real-time analysis: Graph databases can quickly identify new connections as transactions occur
Mule account schemes specifically benefit from graph analysis because they form distinct network patterns:
A central controller account receives large deposits
Funds are rapidly distributed to multiple recruited “mule” accounts
Mules quickly withdraw cash or transfer funds, completing the laundering cycle
These patterns create a recognizable “hub-and-spoke” topology in a graph
In a traditional relational database, finding these patterns requires multiple complex JOINs and recursive queries. In a graph database, you simply ask: “show me accounts connected to this one” or “find all paths between these two accounts.”
Why This Stack?
We’ve chosen a powerful combination of technologies that work seamlessly together:
Memgraph (Graph Database)
Native graph database built for speed and real-time analytics
Uses Cypher query language (intuitive, SQL-like syntax for graphs)
Perfect for fraud detection where you need to explore relationships quickly
Lightweight and runs easily in Docker on your laptop
Open-source with excellent tooling (Memgraph Lab for visualization)
Claude Desktop (AI Interface)
Natural language interface eliminates the need to learn Cypher query syntax
Ask questions in plain English: “Which accounts received money from ACC006?”
Claude translates your questions into optimized graph queries automatically
Provides explanations and insights alongside query results
Dramatically lowers the barrier to entry for graph analysis
MCP (Model Context Protocol)
Connects Claude directly to Memgraph
Enables Claude to execute queries and retrieve real-time data
Secure, local connection—your data never leaves your machine
Extensible architecture allows adding other tools and databases
Why Not PostgreSQL?
While PostgreSQL is excellent for transactional data storage, graph relationships in SQL require:
Complex recursive CTEs (Common Table Expressions) for multi-hop queries
Multiple JOINs that become exponentially slower as relationships deepen
Manual construction of relationship paths
Limited visualization capabilities for network structures
Memgraph’s native graph model represents accounts and transactions as nodes and edges, making relationship queries natural and performant. For fraud detection where you need to quickly explore “who’s connected to whom,” graph databases are the right tool.
What You’ll Build
By following this guide, you’ll create:
The ability to ask natural language questions and get instant graph insights
A local Memgraph database with 57 accounts and 512 transactions
A realistic mule account network hidden among legitimate transactions
An AI-powered analysis interface through Claude Desktop
2. Prerequisites
Before starting, ensure you have:
macOS laptop
Homebrew package manager (we’ll install if needed)
Claude Desktop app installed
Basic terminal knowledge
3. Automated Setup
Below is a massive script. I did have it as single scripts, but it has merged into a large hazardous blob of bash. This script is badged under the “it works on my laptop” disclaimer!
cat > ~/setup_memgraph_complete.sh << 'EOF'
#!/bin/bash
# Complete automated setup for Memgraph + Claude Desktop
echo "========================================"
echo "Memgraph + Claude Desktop Setup"
echo "========================================"
echo ""
# Step 1: Install Rancher Desktop
echo "Step 1/7: Installing Rancher Desktop..."
# Check if Docker daemon is already running
DOCKER_RUNNING=false
if command -v docker &> /dev/null && docker info &> /dev/null 2>&1; then
echo "Container runtime is already running!"
DOCKER_RUNNING=true
fi
if [ "$DOCKER_RUNNING" = false ]; then
# Check if Homebrew is installed
if ! command -v brew &> /dev/null; then
echo "Installing Homebrew first..."
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Add Homebrew to PATH for Apple Silicon Macs
if [[ $(uname -m) == 'arm64' ]]; then
echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> ~/.zprofile
eval "$(/opt/homebrew/bin/brew shellenv)"
fi
fi
# Check if Rancher Desktop is installed
RANCHER_INSTALLED=false
if brew list --cask rancher 2>/dev/null | grep -q rancher; then
RANCHER_INSTALLED=true
echo "Rancher Desktop is installed via Homebrew."
fi
# If not installed, install it
if [ "$RANCHER_INSTALLED" = false ]; then
echo "Installing Rancher Desktop..."
brew install --cask rancher
sleep 3
fi
echo "Starting Rancher Desktop..."
# Launch Rancher Desktop
if [ -d "/Applications/Rancher Desktop.app" ]; then
echo "Launching Rancher Desktop from /Applications..."
open "/Applications/Rancher Desktop.app"
sleep 5
else
echo ""
echo "Please launch Rancher Desktop manually:"
echo " 1. Press Cmd+Space"
echo " 2. Type 'Rancher Desktop'"
echo " 3. Press Enter"
echo ""
echo "Waiting for you to launch Rancher Desktop..."
echo "Press Enter once you've started Rancher Desktop"
read
fi
# Add Rancher Desktop to PATH
export PATH="$HOME/.rd/bin:$PATH"
echo "Waiting for container runtime to start (this may take 30-60 seconds)..."
# Wait for docker command to become available
for i in {1..60}; do
if command -v docker &> /dev/null && docker info &> /dev/null 2>&1; then
echo ""
echo "Container runtime is running!"
break
fi
echo -n "."
sleep 3
done
if ! command -v docker &> /dev/null || ! docker info &> /dev/null 2>&1; then
echo ""
echo "Rancher Desktop is taking longer than expected. Please:"
echo "1. Wait for Rancher Desktop to fully initialize"
echo "2. Accept any permissions requests"
echo "3. Once you see 'Kubernetes is running' in Rancher Desktop, press Enter"
read
# Try to add Rancher Desktop to PATH
export PATH="$HOME/.rd/bin:$PATH"
# Check one more time
if ! command -v docker &> /dev/null || ! docker info &> /dev/null 2>&1; then
echo "Container runtime still not responding."
echo "Please ensure Rancher Desktop is fully started and try again."
exit 1
fi
fi
fi
# Ensure docker is in PATH for the rest of the script
export PATH="$HOME/.rd/bin:$PATH"
echo ""
echo "Step 2/7: Installing Memgraph container..."
# Stop and remove existing container if it exists
if docker ps -a 2>/dev/null | grep -q memgraph; then
echo "Removing existing Memgraph container..."
docker stop memgraph 2>/dev/null || true
docker rm memgraph 2>/dev/null || true
fi
docker pull memgraph/memgraph-platform || { echo "Failed to pull Memgraph image"; exit 1; }
docker run -d -p 7687:7687 -p 7444:7444 -p 3000:3000 \
--name memgraph \
-v memgraph_data:/var/lib/memgraph \
memgraph/memgraph-platform || { echo "Failed to start Memgraph container"; exit 1; }
echo "Waiting for Memgraph to be ready..."
sleep 10
echo ""
echo "Step 3/7: Installing Python and Memgraph MCP server..."
# Install Python if not present
if ! command -v python3 &> /dev/null; then
echo "Installing Python..."
brew install python3
fi
# Install uv package manager
if ! command -v uv &> /dev/null; then
echo "Installing uv package manager..."
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"
fi
echo "Memgraph MCP will be configured to run via uv..."
echo ""
echo "Step 4/7: Configuring Claude Desktop..."
CONFIG_DIR="$HOME/Library/Application Support/Claude"
CONFIG_FILE="$CONFIG_DIR/claude_desktop_config.json"
mkdir -p "$CONFIG_DIR"
if [ -f "$CONFIG_FILE" ] && [ -s "$CONFIG_FILE" ]; then
echo "Backing up existing Claude configuration..."
cp "$CONFIG_FILE" "$CONFIG_FILE.backup.$(date +%s)"
fi
# Get the full path to uv
UV_PATH=$(which uv 2>/dev/null || echo "$HOME/.local/bin/uv")
# Merge memgraph config with existing config
if [ -f "$CONFIG_FILE" ] && [ -s "$CONFIG_FILE" ]; then
echo "Merging memgraph config with existing MCP servers..."
# Use Python to merge JSON (more reliable than jq which may not be installed)
python3 << PYTHON_MERGE
import json
import sys
config_file = "$CONFIG_FILE"
uv_path = "${UV_PATH}"
try:
# Read existing config
with open(config_file, 'r') as f:
config = json.load(f)
# Ensure mcpServers exists
if 'mcpServers' not in config:
config['mcpServers'] = {}
# Add/update memgraph server
config['mcpServers']['memgraph'] = {
"command": uv_path,
"args": [
"run",
"--with",
"mcp-memgraph",
"--python",
"3.13",
"mcp-memgraph"
],
"env": {
"MEMGRAPH_HOST": "localhost",
"MEMGRAPH_PORT": "7687"
}
}
# Write merged config
with open(config_file, 'w') as f:
json.dump(config, f, indent=2)
print("Successfully merged memgraph config")
sys.exit(0)
except Exception as e:
print(f"Error merging config: {e}", file=sys.stderr)
sys.exit(1)
PYTHON_MERGE
if [ $? -ne 0 ]; then
echo "Failed to merge config, creating new one..."
cat > "$CONFIG_FILE" << JSON
{
"mcpServers": {
"memgraph": {
"command": "${UV_PATH}",
"args": [
"run",
"--with",
"mcp-memgraph",
"--python",
"3.13",
"mcp-memgraph"
],
"env": {
"MEMGRAPH_HOST": "localhost",
"MEMGRAPH_PORT": "7687"
}
}
}
}
JSON
fi
else
echo "Creating new Claude Desktop configuration..."
cat > "$CONFIG_FILE" << JSON
{
"mcpServers": {
"memgraph": {
"command": "${UV_PATH}",
"args": [
"run",
"--with",
"mcp-memgraph",
"--python",
"3.13",
"mcp-memgraph"
],
"env": {
"MEMGRAPH_HOST": "localhost",
"MEMGRAPH_PORT": "7687"
}
}
}
}
JSON
fi
echo "Claude Desktop configured!"
echo ""
echo "Step 5/7: Setting up mgconsole..."
echo "mgconsole will be used via Docker (included in memgraph/memgraph-platform)"
echo ""
echo "Step 6/7: Setting up database schema..."
sleep 5 # Give Memgraph extra time to be ready
echo "Clearing existing data..."
echo "MATCH (n) DETACH DELETE n;" | docker exec -i memgraph mgconsole --host 127.0.0.1 --port 7687
echo "Creating indexes..."
cat <<'CYPHER' | docker exec -i memgraph mgconsole --host 127.0.0.1 --port 7687
CREATE INDEX ON :Account(account_id);
CREATE INDEX ON :Account(account_type);
CREATE INDEX ON :Person(person_id);
CYPHER
echo ""
echo "Step 7/7: Populating test data..."
echo "Loading core mule account data..."
cat <<'CYPHER' | docker exec -i memgraph mgconsole --host 127.0.0.1 --port 7687
CREATE (p1:Person {person_id: 'P001', name: 'John Smith', age: 45, risk_score: 'low'})
CREATE (a1:Account {account_id: 'ACC001', account_type: 'checking', balance: 15000, opened_date: '2020-01-15', status: 'active'})
CREATE (p1)-[:OWNS {since: '2020-01-15'}]->(a1)
CREATE (p2:Person {person_id: 'P002', name: 'Sarah Johnson', age: 38, risk_score: 'low'})
CREATE (a2:Account {account_id: 'ACC002', account_type: 'savings', balance: 25000, opened_date: '2019-06-10', status: 'active'})
CREATE (p2)-[:OWNS {since: '2019-06-10'}]->(a2)
CREATE (p3:Person {person_id: 'P003', name: 'Michael Brown', age: 22, risk_score: 'high'})
CREATE (a3:Account {account_id: 'ACC003', account_type: 'checking', balance: 500, opened_date: '2024-08-01', status: 'active'})
CREATE (p3)-[:OWNS {since: '2024-08-01'}]->(a3)
CREATE (p4:Person {person_id: 'P004', name: 'Lisa Chen', age: 19, risk_score: 'high'})
CREATE (a4:Account {account_id: 'ACC004', account_type: 'checking', balance: 300, opened_date: '2024-08-05', status: 'active'})
CREATE (p4)-[:OWNS {since: '2024-08-05'}]->(a4)
CREATE (p5:Person {person_id: 'P005', name: 'David Martinez', age: 21, risk_score: 'high'})
CREATE (a5:Account {account_id: 'ACC005', account_type: 'checking', balance: 450, opened_date: '2024-08-03', status: 'active'})
CREATE (p5)-[:OWNS {since: '2024-08-03'}]->(a5)
CREATE (p6:Person {person_id: 'P006', name: 'Robert Wilson', age: 35, risk_score: 'critical'})
CREATE (a6:Account {account_id: 'ACC006', account_type: 'business', balance: 2000, opened_date: '2024-07-15', status: 'active'})
CREATE (p6)-[:OWNS {since: '2024-07-15'}]->(a6)
CREATE (p7:Person {person_id: 'P007', name: 'Unknown Entity', risk_score: 'critical'})
CREATE (a7:Account {account_id: 'ACC007', account_type: 'business', balance: 150000, opened_date: '2024-06-01', status: 'active'})
CREATE (p7)-[:OWNS {since: '2024-06-01'}]->(a7)
CREATE (a7)-[:TRANSACTION {transaction_id: 'TXN001', amount: 50000, timestamp: '2024-09-01T10:15:00', type: 'wire_transfer', flagged: true}]->(a6)
CREATE (a6)-[:TRANSACTION {transaction_id: 'TXN002', amount: 9500, timestamp: '2024-09-01T14:30:00', type: 'transfer', flagged: true}]->(a3)
CREATE (a6)-[:TRANSACTION {transaction_id: 'TXN003', amount: 9500, timestamp: '2024-09-01T14:32:00', type: 'transfer', flagged: true}]->(a4)
CREATE (a6)-[:TRANSACTION {transaction_id: 'TXN004', amount: 9500, timestamp: '2024-09-01T14:35:00', type: 'transfer', flagged: true}]->(a5)
CREATE (a3)-[:TRANSACTION {transaction_id: 'TXN005', amount: 9000, timestamp: '2024-09-02T09:00:00', type: 'cash_withdrawal', flagged: true}]->(a6)
CREATE (a4)-[:TRANSACTION {transaction_id: 'TXN006', amount: 9000, timestamp: '2024-09-02T09:15:00', type: 'cash_withdrawal', flagged: true}]->(a6)
CREATE (a5)-[:TRANSACTION {transaction_id: 'TXN007', amount: 9000, timestamp: '2024-09-02T09:30:00', type: 'cash_withdrawal', flagged: true}]->(a6)
CREATE (a7)-[:TRANSACTION {transaction_id: 'TXN008', amount: 45000, timestamp: '2024-09-15T11:20:00', type: 'wire_transfer', flagged: true}]->(a6)
CREATE (a6)-[:TRANSACTION {transaction_id: 'TXN009', amount: 9800, timestamp: '2024-09-15T15:00:00', type: 'transfer', flagged: true}]->(a3)
CREATE (a6)-[:TRANSACTION {transaction_id: 'TXN010', amount: 9800, timestamp: '2024-09-15T15:05:00', type: 'transfer', flagged: true}]->(a4)
CREATE (a1)-[:TRANSACTION {transaction_id: 'TXN011', amount: 150, timestamp: '2024-09-10T12:00:00', type: 'debit_card', flagged: false}]->(a2)
CREATE (a2)-[:TRANSACTION {transaction_id: 'TXN012', amount: 1000, timestamp: '2024-09-12T10:00:00', type: 'transfer', flagged: false}]->(a1);
CYPHER
echo "Loading noise data (50 accounts, 500 transactions)..."
cat <<'CYPHER' | docker exec -i memgraph mgconsole --host 127.0.0.1 --port 7687
UNWIND range(1, 50) AS i
WITH i,
['Alice', 'Bob', 'Carol', 'David', 'Emma', 'Frank', 'Grace', 'Henry', 'Iris', 'Jack',
'Karen', 'Leo', 'Mary', 'Nathan', 'Olivia', 'Peter', 'Quinn', 'Rachel', 'Steve', 'Tina',
'Uma', 'Victor', 'Wendy', 'Xavier', 'Yara', 'Zack', 'Amy', 'Ben', 'Chloe', 'Daniel',
'Eva', 'Fred', 'Gina', 'Hugo', 'Ivy', 'James', 'Kate', 'Luke', 'Mia', 'Noah',
'Opal', 'Paul', 'Rosa', 'Sam', 'Tara', 'Umar', 'Vera', 'Will', 'Xena', 'Yuki'] AS firstNames,
['Anderson', 'Baker', 'Clark', 'Davis', 'Evans', 'Foster', 'Garcia', 'Harris', 'Irwin', 'Jones',
'King', 'Lopez', 'Miller', 'Nelson', 'Owens', 'Parker', 'Quinn', 'Reed', 'Scott', 'Taylor',
'Underwood', 'Vargas', 'White', 'Young', 'Zhao', 'Adams', 'Brooks', 'Collins', 'Duncan', 'Ellis'] AS lastNames,
['checking', 'savings', 'checking', 'savings', 'checking'] AS accountTypes,
['low', 'low', 'low', 'medium', 'low'] AS riskScores,
['2018-03-15', '2018-07-22', '2019-01-10', '2019-05-18', '2019-09-30', '2020-02-14', '2020-06-25', '2020-11-08', '2021-04-17', '2021-08-29', '2022-01-20', '2022-05-12', '2022-10-03', '2023-02-28', '2023-07-15'] AS dates
WITH i,
firstNames[toInteger(rand() * size(firstNames))] + ' ' + lastNames[toInteger(rand() * size(lastNames))] AS fullName,
accountTypes[toInteger(rand() * size(accountTypes))] AS accType,
riskScores[toInteger(rand() * size(riskScores))] AS risk,
toInteger(rand() * 40 + 25) AS age,
toInteger(rand() * 80000 + 1000) AS balance,
dates[toInteger(rand() * size(dates))] AS openDate
CREATE (p:Person {person_id: 'NOISE_P' + toString(i), name: fullName, age: age, risk_score: risk})
CREATE (a:Account {account_id: 'NOISE_ACC' + toString(i), account_type: accType, balance: balance, opened_date: openDate, status: 'active'})
CREATE (p)-[:OWNS {since: openDate}]->(a);
UNWIND range(1, 500) AS i
WITH i,
toInteger(rand() * 50 + 1) AS fromIdx,
toInteger(rand() * 50 + 1) AS toIdx,
['transfer', 'debit_card', 'check', 'atm_withdrawal', 'direct_deposit', 'wire_transfer', 'mobile_payment'] AS txnTypes,
['2024-01-15', '2024-02-20', '2024-03-10', '2024-04-05', '2024-05-18', '2024-06-22', '2024-07-14', '2024-08-09', '2024-09-25', '2024-10-30'] AS dates
WHERE fromIdx <> toIdx
WITH i, fromIdx, toIdx, txnTypes, dates,
txnTypes[toInteger(rand() * size(txnTypes))] AS txnType,
toInteger(rand() * 5000 + 10) AS amount,
(rand() < 0.05) AS shouldFlag,
dates[toInteger(rand() * size(dates))] AS txnDate
MATCH (from:Account {account_id: 'NOISE_ACC' + toString(fromIdx)})
MATCH (to:Account {account_id: 'NOISE_ACC' + toString(toIdx)})
CREATE (from)-[:TRANSACTION {
transaction_id: 'NOISE_TXN' + toString(i),
amount: amount,
timestamp: txnDate + 'T' + toString(toInteger(rand() * 24)) + ':' + toString(toInteger(rand() * 60)) + ':00',
type: txnType,
flagged: shouldFlag
}]->(to);
CYPHER
echo ""
echo "========================================"
echo "Setup Complete!"
echo "========================================"
echo ""
echo "Next steps:"
echo "1. Restart Claude Desktop (Quit and reopen)"
echo "2. Open Memgraph Lab at http://localhost:3000"
echo "3. Start asking Claude questions about the mule account data!"
echo ""
echo "Example query: 'Show me all accounts owned by people with high or critical risk scores in Memgraph'"
echo ""
EOF
chmod +x ~/setup_memgraph_complete.sh
~/setup_memgraph_complete.sh
The script will:
Install Rancher Desktop (if not already installed)
Install Homebrew (if needed)
Pull and start Memgraph container
Install Node.js and Memgraph MCP server
Configure Claude Desktop automatically
Install mgconsole CLI tool
Set up database schema with indexes
Populate with mule account data and 500+ noise transactions
After the script completes, restart Claude Desktop (quit and reopen) for the MCP configuration to take effect.
4. Verifying the Setup
Verify the setup by accessing Memgraph Lab at http://localhost:3000 or using mgconsole via Docker:
Now that everything is set up, you can interact with Claude Desktop to analyze the mule account network. Here are example queries you can try:
Example 1: Find All High-Risk Accounts
Ask Claude:
Show me all accounts owned by people with high or critical risk scores in Memgraph
Claude will query Memgraph and return results showing the suspicious accounts (ACC003, ACC004, ACC005, ACC006, ACC007), filtering out the 50+ noise accounts.
Example 2: Identify Transaction Patterns
Ask Claude:
Find all accounts that received money from ACC006 within a 24-hour period. Show the transaction amounts and timestamps.
Claude will identify the three mule accounts (ACC003, ACC004, ACC005) that received similar amounts in quick succession.
Example 3: Trace Money Flow
Ask Claude:
Trace the flow of money from ACC007 through the network. Show me the complete transaction path.
Claude will visualize the path: ACC007 -> ACC006 -> [ACC003, ACC004, ACC005], revealing the laundering pattern.
Example 4: Calculate Total Funds
Ask Claude:
Calculate the total amount of money that flowed through ACC006 in September 2024
Claude will aggregate all incoming and outgoing transactions for the controller account.
Example 5: Find Rapid Withdrawal Patterns
Ask Claude:
Find accounts where money was withdrawn within 48 hours of being deposited. What are the amounts and account holders?
This reveals the classic mule account behavior of quick cash extraction.
Example 6: Network Analysis
Ask Claude:
Show me all accounts that have transaction relationships with ACC006. Create a visualization of this network.
Claude will generate a graph showing the controller account at the center with connections to both the source and mule accounts.
Example 7: Risk Assessment
Ask Claude:
Which accounts have received flagged transactions totaling more than $15,000? List them by total amount.
This helps identify which mule accounts have processed the most illicit funds.
6. Understanding the Graph Visualization
When Claude displays graph results, you’ll see:
Nodes: Circles representing accounts and persons
Edges: Lines representing transactions or ownership relationships
Properties: Attributes like amounts, timestamps, and risk scores
The graph structure makes it easy to spot:
Central nodes (controllers) with many connections
Similar transaction patterns across multiple accounts
Timing correlations between related transactions
Isolation of legitimate vs. suspicious account clusters
7. Advanced Analysis Queries
Once you’re comfortable with basic queries, try these advanced analyses:
Community Detection
Ask Claude:
Find groups of accounts that frequently transact with each other. Are there separate communities in the network?
Temporal Analysis
Ask Claude:
Show me the timeline of transactions for accounts owned by people under 25 years old. Are there any patterns?
Shortest Path Analysis
Ask Claude:
What's the shortest path of transactions between ACC007 and ACC003? How many hops does it take?
8. Cleaning Up
When you’re done experimenting, you can stop and remove the Memgraph container:
docker stop memgraph
docker rm memgraph
To remove the data volume completely:
docker volume rm memgraph_data
To restart later with fresh data, just run the setup script again.
9. Troubleshooting
Docker Not Running
If you get errors about Docker not running:
open -a Docker
Wait for Docker Desktop to start, then verify:
docker info
Memgraph Container Won’t Start
Check if ports are already in use:
lsof -i :7687
lsof -i :3000
Kill any conflicting processes or change the port mappings in the docker run command.
Create additional graph algorithms for anomaly detection
Connect to real banking data sources (with proper security)
Build automated alerting for suspicious patterns
Expand the schema to include IP addresses, devices, and locations
The combination of Memgraph’s graph database capabilities and Claude’s natural language interface makes it easy to explore and analyze complex relationship data without writing complex Cypher queries manually.
11. Conclusion
You now have a complete environment for analyzing banking mule accounts using Memgraph and Claude Desktop. The graph database structure naturally represents the relationships between accounts, making it ideal for fraud detection. Claude’s integration through MCP allows you to query and visualize this data using natural language, making sophisticated analysis accessible without deep technical knowledge.
The test dataset demonstrates typical mule account patterns: rapid movement of funds through multiple accounts, young account holders, recently opened accounts, and structured amounts designed to avoid reporting thresholds. These patterns are much easier to spot in a graph database than in traditional relational databases.
Experiment with different queries and explore how graph thinking can reveal hidden patterns in connected data.
Prepared statements are one of PostgreSQL’s most powerful features for query optimization. By parsing and planning queries once, then reusing those plans for subsequent executions, they can dramatically improve performance. But this optimization comes with a hidden danger: sometimes caching the same plan for every execution can lead to catastrophic memory exhaustion and performance degradation.
In this deep dive, we’ll explore how prepared statement plan caching works, when it fails spectacularly, and how PostgreSQL has evolved to address these challenges.
1. Understanding Prepared Statements and Plan Caching
When you execute a prepared statement in PostgreSQL, the database goes through several phases:
Parsing: Converting the SQL text into a parse tree
Planning: Creating an execution plan based on statistics and parameters
Execution: Running the plan against actual data
The promise of prepared statements is simple: do steps 1 and 2 once, then reuse the results for repeated executions with different parameter values.
-- Prepare the statement
PREPARE get_orders AS
SELECT * FROM orders WHERE customer_id = $1;
-- Execute multiple times with different parameters
EXECUTE get_orders(123);
EXECUTE get_orders(456);
EXECUTE get_orders(789);
PostgreSQL uses a clever heuristic to decide when to cache plans. For the first five executions, it creates a custom plan specific to the parameter values. Starting with the sixth execution, it evaluates whether a generic plan (one that works for any parameter value) would be more efficient. If the average cost of the custom plans is close enough to the generic plan’s cost, PostgreSQL switches to reusing the generic plan.
2. The Dark Side: Memory Exhaustion from Plan Caching
Here’s where things can go catastrophically wrong. Consider a partitioned table:
CREATE TABLE events (
id BIGSERIAL,
event_date DATE,
user_id INTEGER,
event_type TEXT,
data JSONB
) PARTITION BY RANGE (event_date);
-- Create 365 partitions, one per day
CREATE TABLE events_2024_01_01 PARTITION OF events
FOR VALUES FROM ('2024-01-01') TO ('2024-01-02');
CREATE TABLE events_2024_01_02 PARTITION OF events
FOR VALUES FROM ('2024-01-02') TO ('2024-01-03');
-- ... 363 more partitions
Now consider this prepared statement:
PREPARE get_events AS
SELECT * FROM events WHERE event_date = $1;
The Problem: Generic Plans Can’t Prune Partitions
When PostgreSQL creates a generic plan for this query, it doesn’t know which specific date you’ll query at execution time. Without this knowledge, the planner cannot perform partition pruning the critical optimization that eliminates irrelevant partitions from consideration.
Here’s what happens:
Custom plan (first 5 executions): PostgreSQL sees the actual date value, realizes only one partition is relevant, and creates a plan that touches only that partition. Fast and efficient.
Generic plan (6th execution onward): PostgreSQL creates a plan that must be valid for ANY date value. Since it can’t know which partition you’ll need, it includes ALL 365 partitions in the plan.
The result: Instead of reading from 1 partition, PostgreSQL’s generic plan prepares to read from all 365 partitions. This leads to:
Memory exhaustion: The query plan itself becomes enormous, containing nodes for every partition
Planning overhead: Even though the plan is cached, initializing it for execution requires allocating memory for all partition nodes
Execution inefficiency: The executor must check every partition, even though 364 of them will return zero rows
In extreme cases with thousands of partitions, this can consume gigabytes of memory per connection and bring your database to its knees.
3. Partition Pruning: The Critical Optimization and How It Works
Partition pruning is the process of eliminating partitions that cannot possibly contain relevant data based on query constraints. Understanding partition pruning in depth is essential for working with partitioned tables effectively.
3.1 What Is Partition Pruning?
At its core, partition pruning is PostgreSQL’s mechanism for avoiding unnecessary work. When you query a partitioned table, the database analyzes your WHERE clause and determines which partitions could possibly contain matching rows. All other partitions are excluded from the query execution entirely.
Consider a table partitioned by date range:
CREATE TABLE sales (
sale_id BIGINT,
sale_date DATE,
amount NUMERIC,
product_id INTEGER
) PARTITION BY RANGE (sale_date);
CREATE TABLE sales_2023_q1 PARTITION OF sales
FOR VALUES FROM ('2023-01-01') TO ('2023-04-01');
CREATE TABLE sales_2023_q2 PARTITION OF sales
FOR VALUES FROM ('2023-04-01') TO ('2023-07-01');
CREATE TABLE sales_2023_q3 PARTITION OF sales
FOR VALUES FROM ('2023-07-01') TO ('2023-10-01');
CREATE TABLE sales_2023_q4 PARTITION OF sales
FOR VALUES FROM ('2023-10-01') TO ('2024-01-01');
When you execute:
SELECT * FROM sales WHERE sale_date = '2023-05-15';
PostgreSQL performs partition pruning by examining the partition constraints. It determines that only sales_2023_q2 can contain rows with sale_date = ‘2023-05-15’, so it completely ignores the other three partitions. They never get opened, scanned, or loaded into memory.
3.2 The Two Stages of Partition Pruning
PostgreSQL performs partition pruning at two distinct stages in query execution, and understanding the difference is crucial for troubleshooting performance issues.
Stage 1: Plan Time Pruning (Static Pruning)
Plan time pruning happens during the query planning phase, before execution begins. This is the ideal scenario because pruned partitions never appear in the execution plan at all.
When it occurs:
The query contains literal values in the WHERE clause
The partition key columns are directly compared to constants
The planner can evaluate the partition constraints at planning time
Example:
EXPLAIN SELECT * FROM sales WHERE sale_date = '2023-05-15';
Output might show:
Seq Scan on sales_2023_q2 sales
Filter: (sale_date = '2023-05-15'::date)
Notice that only one partition appears in the plan. The other three partitions were pruned away during planning, and they consume zero resources.
What makes plan time pruning possible:
The planner evaluates the WHERE clause condition against each partition’s constraint. For sales_2023_q2, the constraint is:
sale_date >= '2023-04-01' AND sale_date < '2023-07-01'
The planner performs boolean logic: “Can sale_date = ‘2023-05-15’ be true if the constraint requires sale_date >= ‘2023-04-01’ AND sale_date < ‘2023-07-01’?” Yes, it can. For the other partitions, the answer is no, so they’re eliminated.
Performance characteristics:
No runtime overhead for pruned partitions
Minimal memory usage
Optimal query performance
The execution plan is lean and specific
Stage 2: Execution Time Pruning (Dynamic Pruning)
Execution time pruning, also called runtime pruning, happens during query execution rather than planning. This occurs when the planner cannot determine which partitions to prune until the query actually runs.
When it occurs:
Parameters or variables are used instead of literal values
Subqueries provide the filter values
Join conditions determine which partitions are needed
Prepared statements with parameters
Example:
PREPARE get_sales AS
SELECT * FROM sales WHERE sale_date = $1;
EXPLAIN (ANALYZE) EXECUTE get_sales('2023-05-15');
With execution time pruning, the plan initially includes all partitions, but the output shows:
The key indicator is “Subplans Removed: 3”, which tells you that three partitions were pruned at execution time.
How execution time pruning works:
During the initialization phase of query execution, PostgreSQL evaluates the actual parameter values and applies the same constraint checking logic as plan time pruning. However, instead of eliminating partitions from the plan, it marks them as “pruned” and skips their initialization and execution.
The critical difference:
Even though execution time pruning skips scanning the pruned partitions, the plan still contains nodes for all partitions. This means:
Memory is allocated for all partition nodes (though less than full initialization)
The plan structure is larger
There is a small runtime cost to check each partition
More complex bookkeeping is required
This is why execution time pruning, while much better than no pruning, is not quite as efficient as plan time pruning.
3.3 Partition Pruning with Different Partition Strategies
PostgreSQL supports multiple partitioning strategies, and pruning works differently for each.
Range Partitioning
Range partitioning is the most common and supports the most effective pruning:
CREATE TABLE measurements (
measurement_time TIMESTAMPTZ,
sensor_id INTEGER,
value NUMERIC
) PARTITION BY RANGE (measurement_time);
Pruning logic: PostgreSQL uses range comparison. Given a filter like measurement_time >= '2024-01-01' AND measurement_time < '2024-02-01', it identifies all partitions whose ranges overlap with this query range.
Pruning effectiveness: Excellent. Range comparisons are computationally cheap and highly selective.
List Partitioning
List partitioning groups rows by discrete values:
CREATE TABLE orders (
order_id BIGINT,
country_code TEXT,
amount NUMERIC
) PARTITION BY LIST (country_code);
CREATE TABLE orders_us PARTITION OF orders
FOR VALUES IN ('US');
CREATE TABLE orders_uk PARTITION OF orders
FOR VALUES IN ('UK');
CREATE TABLE orders_eu PARTITION OF orders
FOR VALUES IN ('DE', 'FR', 'IT', 'ES');
Pruning logic: PostgreSQL checks if the query value matches any value in each partition’s list.
SELECT * FROM orders WHERE country_code = 'FR';
Only orders_eu is accessed because ‘FR’ appears in its value list.
Pruning effectiveness: Very good for equality comparisons. Less effective for OR conditions across many values or pattern matching.
Hash Partitioning
Hash partitioning distributes rows using a hash function:
CREATE TABLE users (
user_id BIGINT,
username TEXT,
email TEXT
) PARTITION BY HASH (user_id);
CREATE TABLE users_p0 PARTITION OF users
FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE users_p1 PARTITION OF users
FOR VALUES WITH (MODULUS 4, REMAINDER 1);
-- ... p2, p3
Pruning logic: PostgreSQL computes the hash of the query value and determines which partition it maps to.
SELECT * FROM users WHERE user_id = 12345;
PostgreSQL calculates hash(12345) % 4 and accesses only the matching partition.
Pruning effectiveness: Excellent for equality on the partition key. Completely ineffective for range queries, pattern matching, or anything except exact equality matches.
3.4 Complex Partition Pruning Scenarios
Real world queries are often more complex than simple equality comparisons. Here’s how pruning handles various scenarios:
Multi Column Partition Keys
CREATE TABLE events (
event_date DATE,
region TEXT,
data JSONB
) PARTITION BY RANGE (event_date, region);
Pruning works on the leading columns of the partition key. A query filtering only on event_date can still prune effectively. A query filtering only on region cannot prune at all because region is not the leading column.
OR Conditions
SELECT * FROM sales
WHERE sale_date = '2023-05-15' OR sale_date = '2023-08-20';
PostgreSQL must access partitions for both dates (Q2 and Q3), so it keeps both and prunes Q1 and Q4. OR conditions reduce pruning effectiveness.
Inequality Comparisons
SELECT * FROM sales WHERE sale_date >= '2023-05-01';
PostgreSQL prunes partitions entirely before the date (Q1) but must keep all partitions from Q2 onward. Range queries reduce pruning selectivity.
Joins Between Partitioned Tables
SELECT * FROM sales s
JOIN products p ON s.product_id = p.product_id
WHERE s.sale_date = '2023-05-15';
If sales is partitioned by sale_date, that partition pruning works normally. If products is also partitioned, PostgreSQL attempts partitionwise joins where possible, enabling pruning on both sides.
Subqueries Providing Values
SELECT * FROM sales
WHERE sale_date = (SELECT MAX(order_date) FROM orders);
This requires execution time pruning because the subquery must run before PostgreSQL knows which partition to access.
3.5 Monitoring Partition Pruning
To verify partition pruning is working, use EXPLAIN:
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM sales WHERE sale_date = '2023-05-15';
What to look for:
Plan time pruning succeeded:
Seq Scan on sales_2023_q2
Only one partition appears in the plan at all.
Execution time pruning succeeded:
Append
Subplans Removed: 3
-> Seq Scan on sales_2023_q2
All partitions appear in the plan structure, but “Subplans Removed” shows pruning happened.
No pruning occurred:
Append
-> Seq Scan on sales_2023_q1
-> Seq Scan on sales_2023_q2
-> Seq Scan on sales_2023_q3
-> Seq Scan on sales_2023_q4
All partitions were scanned. This indicates a problem.
3.6 Why Partition Pruning Fails
Understanding why pruning fails helps you fix it:
Query doesn’t filter on partition key: If your WHERE clause doesn’t reference the partition column(s), PostgreSQL cannot prune.
Function calls on partition key:WHERE EXTRACT(YEAR FROM sale_date) = 2023 prevents pruning because PostgreSQL can’t map the function result back to partition ranges. Use WHERE sale_date >= '2023-01-01' AND sale_date < '2024-01-01' instead.
Type mismatches: If your partition key is DATE but you compare to TEXT without explicit casting, pruning may fail.
Generic plans in prepared statements: As discussed in the main article, generic plans prevent plan time pruning and older PostgreSQL versions struggled with execution time pruning.
OR conditions with non-partition columns:WHERE sale_date = '2023-05-15' OR customer_id = 100 prevents pruning because customer_id isn’t the partition key.
Volatile functions:WHERE sale_date = CURRENT_DATE may prevent plan time pruning (but should work with execution time pruning).
3.7 Partition Pruning Performance Impact
The performance difference between pruned and unpruned queries can be staggering:
Example scenario: 1000 partitions, each with 1 million rows. Query targets one partition.
With pruning:
Partitions opened: 1
Rows scanned: 1 million
Memory for plan nodes: ~10KB
Query time: 50ms
Without pruning:
Partitions opened: 1000
Rows scanned: 1 billion (returning ~1 million)
Memory for plan nodes: ~10MB
Query time: 45 seconds
The difference is not incremental; it’s exponential as partition count grows.
4. Partition Pruning in Prepared Statements: The Core Problem
Let me illustrate the severity with a real-world scenario:
-- Table with 1000 partitions
CREATE TABLE metrics (
timestamp TIMESTAMPTZ,
metric_name TEXT,
value NUMERIC
) PARTITION BY RANGE (timestamp);
-- Create 1000 daily partitions...
-- Prepared statement
PREPARE get_metrics AS
SELECT * FROM metrics
WHERE timestamp >= $1 AND timestamp < $2;
After the 6th execution, PostgreSQL switches to a generic plan. Each subsequent execution:
Allocates memory for 1000 partition nodes
Initializes executor state for 1000 partitions
Checks 1000 partition constraints
Returns data from just 1-2 partitions
If you have 100 connections each executing this prepared statement, you’re multiplying this overhead by 100. With connection poolers reusing connections (and thus reusing prepared statements), the problem compounds.
6. The Fix: Evolution Across PostgreSQL Versions
PostgreSQL has steadily improved partition pruning for prepared statements:
PostgreSQL 11: Execution-Time Pruning Introduced
PostgreSQL 11 introduced run-time partition pruning, but it had significant limitations with prepared statements. Generic plans still included all partitions in memory, even if they could be skipped during execution.
PostgreSQL 12: Better Prepared Statement Pruning
PostgreSQL 12 made substantial improvements:
Generic plans gained the ability to defer partition pruning to execution time more effectively
The planner became smarter about when to use generic vs. custom plans for partitioned tables
Memory consumption for generic plans improved significantly
However, issues remained in edge cases, particularly with: • Multi level partitioning • Complex join queries involving partitioned tables • Prepared statements in stored procedures
PostgreSQL 13-14: Refined Heuristics
These versions improved the cost model for deciding between custom and generic plans:
Better accounting for partition pruning benefits in the cost calculation
More accurate statistics gathering on partitioned tables
Improved handling of partitionwise joins
PostgreSQL 15-16: The Real Game Changers
PostgreSQL 15 and 16 brought transformative improvements:
PostgreSQL 15:
Dramatically reduced memory usage for generic plans on partitioned tables
Improved execution-time pruning performance
Better handling of prepared statements with partition pruning
PostgreSQL 16:
Introduced incremental sorting improvements that benefit partitioned queries
Enhanced partition-wise aggregation
More aggressive execution-time pruning
The key breakthrough: PostgreSQL now builds “stub” plans that allocate minimal memory for partitions that will be pruned, rather than fully initializing all partition nodes.
Workarounds for Older Versions
If you’re stuck on older PostgreSQL versions, here are strategies to avoid the prepared statement pitfall:
1. Disable Generic Plans
Force PostgreSQL to always use custom plans:
-- Set at session level
SET plan_cache_mode = force_custom_plan;
-- Or for specific prepared statement contexts
PREPARE get_events AS
SELECT * FROM events WHERE event_date = $1;
-- Before execution
SET LOCAL plan_cache_mode = force_custom_plan;
EXECUTE get_events('2024-06-15');
This sacrifices the planning time savings but ensures proper partition pruning.
2. Use Statement Level Caching Instead
Many ORMs and database drivers offer statement level caching that doesn’t persist across multiple executions:
# psycopg2 example - named cursors create server-side cursors
# but don't persist plans
cursor = connection.cursor()
cursor.execute(
"SELECT * FROM events WHERE event_date = %s",
(date_value,)
)
3. Adjust plan_cache_mode Per Query
PostgreSQL 12+ provides plan_cache_mode:
-- auto (default): use PostgreSQL's heuristics
-- force_generic_plan: always use generic plan
-- force_custom_plan: always use custom plan
SET plan_cache_mode = force_custom_plan;
For partitioned tables, force_custom_plan is often the right choice.
4. Increase Custom Plan Count
The threshold of 5 custom plans before switching to generic is hardcoded, but you can work around it by using different prepared statement names or by periodically deallocating and recreating prepared statements:
DEALLOCATE get_events;
PREPARE get_events AS SELECT * FROM events WHERE event_date = $1;
5. Partition Pruning Hints
In PostgreSQL 12+, you can sometimes coerce the planner into better behavior:
-- Using an explicit constraint that helps the planner
SELECT * FROM events
WHERE event_date = $1
AND event_date >= CURRENT_DATE - INTERVAL '1 year';
This additional constraint provides a hint about the parameter range.
Best Practices
Monitor your query plans: Use EXPLAIN (ANALYZE, BUFFERS) to check if partition pruning is happening:
Check prepared statement statistics: Query pg_prepared_statements to see generic vs. custom plan usage:
SELECT name,
generic_plans,
custom_plans
FROM pg_prepared_statements;
Upgrade PostgreSQL: If you’re dealing with large partitioned tables, the improvements in PostgreSQL 15+ are worth the upgrade effort.
Design partitions appropriately: Don’t over-partition. Having 10,000 tiny partitions creates problems even with perfect pruning.
Use connection pooling wisely: Prepared statements persist per connection. With connection pooling, long-lived connections accumulate many prepared statements. Configure your pooler to occasionally recycle connections.
Benchmark both modes: Test your specific workload with both custom and generic plans to measure the actual impact.
Conclusion
Prepared statements are a powerful optimization, but their interaction with partitioned tables exposes a fundamental tension: caching for reuse versus specificity for efficiency. PostgreSQL’s evolution from version 11 through 16 represents a masterclass in addressing this challenge.
The key takeaway: if you’re using prepared statements with partitioned tables on PostgreSQL versions older than 15, be vigilant about plan caching behavior. Monitor memory usage, check execution plans, and don’t hesitate to force custom plans when generic plans cause problems.
For modern PostgreSQL installations (15+), the improvements are substantial enough that the traditional guidance of “be careful with prepared statements on partitioned tables” is becoming outdated. The database now handles these scenarios with far more intelligence and efficiency.
But understanding the history and mechanics remains crucial, because the next time you see mysterious memory growth in your PostgreSQL connections, you’ll know exactly where to look.
If you have tier 1 services that are dependant on a few DNS records, then you may want a simple batch job to monitor these dns records for changes or deletion.
The script below contains an example list of DNS entries (replace these records for the ones you want to monitor).
@echo off
setlocal enabledelayedexpansion
REM ============================================================================
REM DNS Monitor Script for Windows Server
REM Purpose: Monitor DNS entries for changes every 15 minutes
REM Author: Andrew Baker
REM Version: 1.0
REM Date: August 13, 2018
REM ============================================================================
REM Configuration Variables
set "LOG_FILE=dns_monitor.log"
set "PREVIOUS_FILE=dns_previous.tmp"
set "CURRENT_FILE=dns_current.tmp"
set "CHECK_INTERVAL=900"
REM DNS Entries to Monitor (Comma Separated List)
REM Add or modify domains as needed
set "DNS_LIST=google.com,microsoft.com,github.com,stackoverflow.com,amazon.com,facebook.com,twitter.com,linkedin.com,youtube.com,cloudflare.com"
REM Initialize log file with header if it doesn't exist
if not exist "%LOG_FILE%" (
echo DNS Monitor Log - Started on %DATE% %TIME% > "%LOG_FILE%"
echo ============================================================================ >> "%LOG_FILE%"
echo. >> "%LOG_FILE%"
)
:MAIN_LOOP
echo [%DATE% %TIME%] Starting DNS monitoring cycle...
echo [%DATE% %TIME%] INFO: Starting DNS monitoring cycle >> "%LOG_FILE%"
REM Clear current results file
if exist "%CURRENT_FILE%" del "%CURRENT_FILE%"
REM Process each DNS entry
for %%d in (%DNS_LIST%) do (
call :CHECK_DNS "%%d"
)
REM Compare with previous results if they exist
if exist "%PREVIOUS_FILE%" (
call :COMPARE_RESULTS
) else (
echo [%DATE% %TIME%] INFO: First run - establishing baseline >> "%LOG_FILE%"
)
REM Copy current results to previous for next comparison
copy "%CURRENT_FILE%" "%PREVIOUS_FILE%" >nul 2>&1
echo [%DATE% %TIME%] DNS monitoring cycle completed. Next check in 15 minutes...
echo [%DATE% %TIME%] INFO: DNS monitoring cycle completed >> "%LOG_FILE%"
echo. >> "%LOG_FILE%"
REM Wait 15 minutes (900 seconds) before next check
timeout /t %CHECK_INTERVAL% /nobreak >nul
goto MAIN_LOOP
REM ============================================================================
REM Function: CHECK_DNS
REM Purpose: Resolve DNS entry and log results
REM Parameter: %1 = Domain name to check
REM ============================================================================
:CHECK_DNS
set "DOMAIN=%~1"
echo Checking DNS for: %DOMAIN%
REM Perform nslookup and capture results
nslookup "%DOMAIN%" > temp_dns.txt 2>&1
REM Check if nslookup was successful
if %ERRORLEVEL% equ 0 (
REM Extract IP addresses from nslookup output
for /f "tokens=2" %%i in ('findstr /c:"Address:" temp_dns.txt ^| findstr /v "#53"') do (
set "IP_ADDRESS=%%i"
echo %DOMAIN%,!IP_ADDRESS! >> "%CURRENT_FILE%"
echo [%DATE% %TIME%] INFO: %DOMAIN% resolves to !IP_ADDRESS! >> "%LOG_FILE%"
)
REM Handle case where no IP addresses were found in successful lookup
findstr /c:"Address:" temp_dns.txt | findstr /v "#53" >nul
if !ERRORLEVEL! neq 0 (
echo %DOMAIN%,RESOLUTION_ERROR >> "%CURRENT_FILE%"
echo [%DATE% %TIME%] ERROR: %DOMAIN% - No IP addresses found in DNS response >> "%LOG_FILE%"
type temp_dns.txt >> "%LOG_FILE%"
echo. >> "%LOG_FILE%"
)
) else (
REM DNS resolution failed
echo %DOMAIN%,DNS_FAILURE >> "%CURRENT_FILE%"
echo [%DATE% %TIME%] ERROR: %DOMAIN% - DNS resolution failed >> "%LOG_FILE%"
type temp_dns.txt >> "%LOG_FILE%"
echo. >> "%LOG_FILE%"
)
REM Clean up temporary file
if exist temp_dns.txt del temp_dns.txt
goto :EOF
REM ============================================================================
REM Function: COMPARE_RESULTS
REM Purpose: Compare current DNS results with previous results
REM ============================================================================
:COMPARE_RESULTS
echo Comparing DNS results for changes...
REM Read previous results into memory
if exist "%PREVIOUS_FILE%" (
for /f "tokens=1,2 delims=," %%a in (%PREVIOUS_FILE%) do (
set "PREV_%%a=%%b"
)
)
REM Compare current results with previous
for /f "tokens=1,2 delims=," %%a in (%CURRENT_FILE%) do (
set "CURRENT_DOMAIN=%%a"
set "CURRENT_IP=%%b"
REM Get previous IP for this domain
set "PREVIOUS_IP=!PREV_%%a!"
if "!PREVIOUS_IP!"=="" (
REM New domain added
echo [%DATE% %TIME%] INFO: New domain added to monitoring: !CURRENT_DOMAIN! = !CURRENT_IP! >> "%LOG_FILE%"
) else if "!PREVIOUS_IP!" neq "!CURRENT_IP!" (
REM DNS change detected
echo [%DATE% %TIME%] WARNING: DNS change detected for !CURRENT_DOMAIN! >> "%LOG_FILE%"
echo [%DATE% %TIME%] WARNING: Previous IP: !PREVIOUS_IP! >> "%LOG_FILE%"
echo [%DATE% %TIME%] WARNING: Current IP: !CURRENT_IP! >> "%LOG_FILE%"
echo [%DATE% %TIME%] WARNING: *** INVESTIGATE DNS CHANGE *** >> "%LOG_FILE%"
echo. >> "%LOG_FILE%"
REM Also display warning on console
echo.
echo *** WARNING: DNS CHANGE DETECTED ***
echo Domain: !CURRENT_DOMAIN!
echo Previous: !PREVIOUS_IP!
echo Current: !CURRENT_IP!
echo Check log file for details: %LOG_FILE%
echo.
)
)
REM Check for domains that disappeared from current results
for /f "tokens=1,2 delims=," %%a in (%PREVIOUS_FILE%) do (
set "CHECK_DOMAIN=%%a"
set "FOUND=0"
for /f "tokens=1 delims=," %%c in (%CURRENT_FILE%) do (
if "%%c"=="!CHECK_DOMAIN!" set "FOUND=1"
)
if "!FOUND!"=="0" (
echo [%DATE% %TIME%] WARNING: Domain !CHECK_DOMAIN! no longer resolving or removed from monitoring >> "%LOG_FILE%"
)
)
goto :EOF
REM ============================================================================
REM End of Script
REM ============================================================================
If you see the error “The capture session could not be initiated on the device “en0″ (You don’t have permission to capture on that device)” when trying to start a pcap on wireshare you can try installing ChmodBPF; but I suspect you will need to follow the steps below:
$ whoami
superman
$ cd /dev
/dev $ sudo chown superman:admin bp*
Password:
$ ls -la | grep bp
crw------- 1 cp363412 admin 0x17000000 Jan 13 21:48 bpf0
crw------- 1 cp363412 admin 0x17000001 Jan 14 09:56 bpf1
crw------- 1 cp363412 admin 0x17000002 Jan 13 20:57 bpf2
crw------- 1 cp363412 admin 0x17000003 Jan 13 20:57 bpf3
crw------- 1 cp363412 admin 0x17000004 Jan 13 20:57 bpf4
/dev $
If you want to automatically renew your certs then the easiest way is to setup a cron just to call letsencrypt periodically. Below is an example cron job:
First create the bash script to renew the certificate
Getting an application knocked out with a simple SYN flood is both embarrassing and avoidable. Its also very easy to create a SYN flood and so its something you should design against. Below is the hping3 command line that I use to test my services against SYN floods. I have used quite a few mods, to make the test a bit more realistic – but you can also distribute this across a few machines to stretch the target host a bit more if you want to.
Parameters:
-c –count Stop after sending (and receiving) count response packets. After the last packet was sent, hping3 wait COUNTREACHED_TIMEOUT seconds target host replies. You are able to tune COUNTREACHED_TIMEOUT editing hping3.h
-d –data data size Set packet body size. Warning, using –data 40 hping3 will not generate 0 byte packets but protocol_header+40 bytes. hping3 will display packet size information as first line output, like this: HPING www.yahoo.com (ppp0 204.71.200.67): NO FLAGS are set, 40 headers + 40 data bytes
-S –syn Set SYN tcp flag
-w –win Set TCP window size. Default is 64.
-p –destport [+][+]dest port Set destination port, default is 0. If ‘+’ character precedes dest port number (i.e. +1024) destination port will be increased for each reply received. If double ‘+’ precedes dest port number (i.e. ++1024), destination port will be increased for each packet sent. By default destination port can be modified interactively using CTRL+z.
–flood send packets as fast as possible, without waiting for incoming replies. This is faster than the -i u0 option.
–rand-source This option enables the random source mode. hping will send packets with random source address. It is interesting to use this option to stress firewall state tables, and other per-ip basis dynamic tables inside the TCP/IP stacks and firewall software.
Today I am a happy bunny!!!! Yury Tsarev (a very clever dude) did a presentation to one of the Kubernetes co-founders Tim Hockin. The demo was one of absa banks opensource projects called K8GB (a cloud native GSLB for K8s): https://www.k8gb.io/
Why do I like K8GB? Because it uses a single CRD that integrates to all the big DNS providers (like NS1, Infoblox and Route 53), it then uses zone delegation to allow the developers to manage their GTM in a single CRD. It also has native health checks using the liveliness and readiness kubernates probes (rather than just IMCP or http responses). Put simply it saves a whole bunch of unnecessary yak shaving!
So, if you have a few minutes you can watch the video (the demo starts after 10mins or so):