ShedLock and Payments: Every Way It Can Fail, How to Retry Safely, and Every Lock Type Worth Knowing
ShedLock prevents duplicate execution of scheduled tasks across multiple application instances by storing a distributed lock in a shared data store such as a database, Redis, or ZooKeeper. When a scheduled method fires, ShedLock attempts to insert or update a lock record; only the node that successfully acquires that lock proceeds, while all others skip execution entirely for that interval.
1. What ShedLock Is and How It Works
Spring Boot makes it trivial to schedule a task. You add @EnableScheduling to a configuration class, annotate a method with @Scheduled, and the framework fires it on your chosen cron or interval. The problem surfaces the moment you deploy more than one instance of your application. In a Kubernetes cluster of ten pods, every pod fires every scheduled method at the same time. For a report generation job that runs slightly redundant work, this is tolerable. For a payment batch job that submits instructions to a processor, it is a direct path to charging customers twice.
ShedLock, created by Lukas Krecan, solves exactly this problem. It is a lightweight Java library that ensures a scheduled task executes on only one node at a time in a distributed environment. It does this by writing a lock record to an external store before the task runs. Every other node that fires at the same time reads that record, sees the lock is held, and skips execution silently. When the task finishes, the lock is released immediately. If the holding node crashes before it can release the lock, the record expires automatically after a configured maximum duration and the next scheduled run can proceed normally on whichever node wins the lock next.
The external store can be a relational database via JDBC, MongoDB, Redis, DynamoDB, Etcd, Hazelcast, or ZooKeeper. For most backend services the JDBC provider backed by the application’s primary database is the right choice, for reasons that become clear in the failure modes section.
The lock record itself is a single table with four columns: a name that acts as the primary key, a lock_until timestamp, a locked_at timestamp, and a locked_by hostname. ShedLock requires this table to exist before the application starts.
CREATE TABLE shedlock (
name VARCHAR(64) NOT NULL,
lock_until TIMESTAMP(3) NOT NULL,
locked_at TIMESTAMP(3) NOT NULL,
locked_by VARCHAR(255) NOT NULL,
CONSTRAINT pk_shedlock PRIMARY KEY (name)
); Configuration in Spring Boot is minimal. You enable the feature with @EnableSchedulerLock, provide a LockProvider bean, and annotate each scheduled method with @SchedulerLock.
@Configuration
@EnableScheduling
@EnableSchedulerLock(defaultLockAtMostFor = "10m")
public class SchedulingConfig {
@Bean
public LockProvider lockProvider(DataSource dataSource) {
return new JdbcTemplateLockProvider(
JdbcTemplateLockProvider.Configuration.builder()
.withJdbcTemplate(new JdbcTemplate(dataSource))
.usingDbTime()
.build()
);
}
} @Scheduled(cron = "0 */15 * * * *")
@SchedulerLock(
name = "paymentBatchJob",
lockAtLeastFor = "PT13M",
lockAtMostFor = "PT45M"
)
public void runPaymentBatch() {
// only one node executes this at a time
} The two duration parameters on @SchedulerLock are the most important configuration decisions you will make. lockAtMostFor is the safety net: if the node holding the lock crashes or hangs, the lock expires after this duration so other nodes can proceed. The ShedLock README is explicit that this value should be significantly larger than the maximum estimated execution time, because if the task takes longer than lockAtMostFor the results are unpredictable and more than one process will effectively hold the lock simultaneously. lockAtLeastFor is the minimum hold time, which prevents rapid re-execution on a different node when a task completes very quickly and clock drift between nodes is larger than the task duration.
The usingDbTime() call on the JDBC provider is not optional for production use. Without it, ShedLock uses application node clocks for timestamp comparisons. With it, all timestamp operations are governed by the database clock, which eliminates clock skew between nodes as a failure mode entirely. It also means the lock mechanism fails closed when the database is unavailable: no timestamps can be evaluated, no lock can be acquired, and no execution proceeds.
ShedLock also provides two mechanisms for jobs that may legitimately run longer than lockAtMostFor can safely cover. LockExtender.extendActiveLock lets code inside the locked method extend the lease at runtime, which is the right pattern for batch jobs processing records in a loop. KeepAliveLockProvider wraps the underlying provider and extends automatically at the midpoint of each lockAtMostFor interval, which is simpler but less precise. The ShedLock author recommends KeepAliveLockProvider only in special cases because a stalled job will keep extending indefinitely rather than letting the lease expire naturally.
It is worth being clear about what ShedLock is not. The library’s own documentation states it is not and will never be a full-fledged scheduler. It does not track whether a job succeeded or failed. It does not retry failed jobs. It does not guarantee that a job runs at all, only that if it runs, it runs on one node. If you need distributed scheduling with failure recovery and guaranteed execution semantics, JobRunr or db-scheduler are designed for that problem. ShedLock is specifically a scheduling lock, and that narrowness is both its strength and the source of every failure mode described in the sections that follow.
2. Why Payments Make Everything Harder
ShedLock is genuinely safe for the majority of scheduled tasks. A job that generates a nightly report, sends a digest email, or cleans up expired sessions can tolerate duplicate execution because the worst outcome is a redundant database write or a slightly annoying second email. Payments cannot tolerate duplicate execution under any conditions because the worst outcome is a customer being charged twice, and no engineering team wants to be the one explaining that to their regulator.
The asymmetry is not about probability. Under normal conditions with a correctly configured ShedLock, duplicate execution is extremely unlikely. The asymmetry is about consequences. In payments, the failure conditions that ShedLock was not designed to handle are precisely the conditions that matter most: the node that crashes mid-batch, the lease that expires three seconds before the last payment in a chunk is dispatched, the Redis cluster that partitions at 02:00 on a Saturday morning when the overnight batch is running. These are not exotic edge cases. They are the routine failure modes of distributed systems at scale, and they happen at the worst possible times.
The engineering standard that follows from this is direct: never depend on a distributed lock as your sole correctness mechanism for payments. A lock is a coordination primitive. It manages who is allowed to act. Idempotency is a correctness primitive. It manages what the outcome will be regardless of who acted, and regardless of how many times they acted. A payment system that has only coordination is one that is safe when everything works and dangerous when anything does not. A payment system that has both is one where the failure modes are bounded, recoverable, and visible before a customer or a regulator notices them.
The distinction maps cleanly to mechanism choices.
| Concern | Category | Mechanism |
|---|---|---|
| Prevent duplicate work | Coordination | ShedLock, DB row lock |
| Detect record conflicts | Coordination | Optimistic versioning |
| Guarantee unique outcomes | Correctness | Idempotency keys |
| Recover from crashes | Correctness | Retry with state check |
| Guarantee final state | Correctness | State machine + reconciliation |
| Handle ordering | Correctness | Version predicates |
3. Real Failures: What Happens When Payment Systems Get This Wrong
The failures described in this section are not theoretical. They are documented incidents at real financial institutions, each of which resulted in customer harm, regulatory scrutiny, and in some cases permanent financial loss to the institution itself.
In March 2022, TSB customers in the UK found that payments had been taken twice from their accounts, leaving some unable to pay for food and bills. The bank confirmed duplicate payments were occurring and began issuing refunds, but the incident followed a pattern TSB had already become known for: a major IT migration failure in 2018 had affected 1.9 million customers, and an independent report found the bank’s board lacked basic common sense in how it prepared for the migration, deploying a system that had not been properly tested. The 2022 duplication incident was a separate technical error but in the same category: a failure of the payment processing layer to enforce exactly-once semantics under unexpected conditions.
NatWest suffered a similar incident when debit card purchases began appearing twice on customer accounts, causing some customers to be overdrawn and without funds. The bank confirmed the issue started several days before it became publicly visible, meaning the duplicate state had been accumulating in the system before anyone detected it. This is precisely the failure mode that an absent reconciliation pipeline produces: silent duplication that only surfaces when customers check their balances.
RBS in 2015 experienced a different but equally consequential failure: 600,000 payments failed to credit customer accounts, with wages, tax credits, and disability living allowance among the affected transactions. The bank’s chairman of the parliamentary committee overseeing the incident described the situation as unacceptable. The payments were eventually processed but the damage to customer trust and the regulatory scrutiny that followed were significant. This was a lost payment failure rather than a duplication failure, but it illustrates the same underlying principle: when payment batch processing fails, the consequences are immediate and highly visible to customers who depend on those funds.
The most financially catastrophic payment processing failure of recent years did not involve a consumer batch job at all. In August 2020, Citibank intended to wire a $7.8 million interim interest payment on behalf of Revlon to its lenders. Instead, due to human error in a 24-year-old software system called Flexcube, Citibank wired $900 million — the full principal balance of the loan — to Revlon’s creditors. Some creditors returned their share, but ten refused. After two years of litigation, a federal judge ruled that Citibank could not recover approximately $500 million of the mistaken transfer, citing a legal principle that the recipients neither knew nor should have known that the payment was in error. Citibank CEO Jane Fraser later described the incident as a massive, unforced error. The root cause, beyond human error, was a payment software system so old and so complex that three people reviewing a wire transfer were unable to correctly interpret what would be dispatched when they confirmed the instruction. The system did not make the outcome of the operation legible before execution. It had no last-chance idempotency check, no pre-flight verification of dispatch amount against expected amount, and no structural guard against the operation being unrecoverable.
These incidents share a common thread. In every case, the payment processing layer failed to enforce exactly-once semantics under conditions that were unusual but not extraordinary. A correctly instrumented system with idempotency keys at the dispatch layer, a dispatch state table with a unique constraint, and a reconciliation pipeline would have caught the duplication or the misconfiguration before it reached customers. The Citibank case required something additional — a structural assertion that the amount being wired matched the amount authorised — but the category of failure is the same: a payment system that relied on process and manual review where it needed structural enforcement.
4. The Six Ways ShedLock Fails for Payments
4.1 Lock Duration Misconfiguration
Every @SchedulerLock annotation requires a lockAtMostFor value, and that value is almost universally set by intuition rather than measurement. Consider a payment batch job that runs every fifteen minutes and typically completes in forty seconds. You set lockAtMostFor to five minutes because that feels generous. On a day when your database is under load, a downstream processor is responding slowly, and your batch size has grown by thirty percent, the job takes six minutes and twenty seconds. At the five-minute mark, the lease expires. A second node fires the job, acquires the lease, and begins processing. The first node is still working through the same batch. Both nodes are now dispatching instructions for the same payment records.
The ShedLock README states this directly: if a task takes longer than lockAtMostFor, the resulting behaviour may be unpredictable and more than one process will effectively hold the lock. The fix requires two things: instrument your job to measure actual p99 execution time in production under realistic load, then set lockAtMostFor to at least three times that number.
@Scheduled(cron = "0 */15 * * * *")
@SchedulerLock(
name = "paymentBatchJob",
lockAtLeastFor = "PT13M",
lockAtMostFor = "PT45M" // 3× measured p99 under load
)
public void runPaymentBatch() {
long start = System.currentTimeMillis();
try {
processPaymentBatch();
} finally {
meterRegistry.timer("payment.batch.duration")
.record(System.currentTimeMillis() - start, TimeUnit.MILLISECONDS);
}
} Without execution time metrics in production you are guessing at lockAtMostFor, and that guess will be wrong under exactly the conditions where it matters most.
4.2 The Lease Expires While the Batch Is Still Running
Even with a correctly measured lockAtMostFor, a job can legitimately run longer than expected due to downstream slowness, connection pool exhaustion, or network partition-induced retry backoff. The right pattern is to call LockExtender.extendActiveLock at the start of each processing chunk, which heartbeats the lease for as long as the job makes progress and stops extending if the job stalls, allowing the lease to expire naturally.
public void runPaymentBatch() {
List<List<Payment>> chunks = Lists.partition(getPending(), 100);
for (List<Payment> chunk : chunks) {
// Extend by 10 minutes before each chunk.
// Throws if the lock store is unavailable, halting the job safely.
LockExtender.extendActiveLock(Duration.ofMinutes(10), Duration.ZERO);
processChunk(chunk);
}
} A stalled job with KeepAliveLockProvider extends indefinitely. A stalled job with explicit chunk-level extension stops extending and lets the lease expire, which is the correct behaviour for a payment batch that has stopped making progress.
4.3 The Lock Store Becomes Unavailable
If the lock store fails before a job acquires the lease, the job is skipped, which is the safe outcome. The more dangerous scenario is brief unavailability at exactly the scheduled execution time: multiple nodes all attempt to acquire the lease simultaneously during recovery. The usingDbTime() configuration mitigates this by using database timestamps rather than node clocks. Without it, clock skew between nodes can cause a lock that is legitimately held to appear expired to a node whose clock is ahead.
If you are using Redis as the lock store, a cluster partition can cause two nodes to independently believe they hold the same lease. This is the Redlock failure mode that Martin Kleppmann analysed definitively in 2016, showing it has no reliable fix within the Redis model. For payment jobs, the JDBC provider backed by your primary payment database eliminates this class of failure: the lock state and the payment state live in the same transactional boundary, and the store that governs locking is the same store that governs payment records.
4.4 No Job Completion Signal
ShedLock does not know whether your job succeeded or failed. It releases the lock when the annotated method returns or throws, treating both outcomes identically. If your payment batch throws an uncaught exception after dispatching sixty percent of the records, the lock is released and the next scheduled run begins with no awareness that a partial batch was processed. Without an external record of which instructions were successfully dispatched, the retry reprocesses the entire batch including the instructions already sent.
This is not a ShedLock design flaw. It is a design boundary. The solution is the dispatch state table and idempotency key pattern described in section 5.
4.5 Idempotency Keys Derived From Runtime State
A UUID generated at dispatch time is not an idempotency key in any meaningful sense. A retry generates a different UUID, the processor treats it as a new instruction, and the payment is duplicated. The key must be derived from the business event itself so that the same payment always produces the same key regardless of when or how many times the derivation runs.
private String buildIdempotencyKey(Payment payment) {
String raw = payment.getId()
+ "|" + payment.getAmountMinorUnits()
+ "|" + payment.getDestinationAccountRef()
+ "|" + payment.getScheduledDate().toString();
return Hashing.sha256()
.hashString(raw, StandardCharsets.UTF_8)
.toString();
} Most major payment processors including Stripe and Adyen support idempotency keys natively. When your processor receives a request with a key it has already processed, it returns the original result rather than executing again. This is the mechanism that transforms a duplicate dispatch into an idempotent operation.
4.6 No Observability on Lock Behaviour
Lock acquisition failures, extended lease durations, and lock store errors are silent unless you instrument them. In a payments context they surface through customer complaints or reconciliation breaks rather than monitoring alerts, by which point the damage is done. The minimum instrumentation set covers job execution duration as a histogram (to inform lockAtMostFor tuning), lock acquisition failures as a counter (to detect unexpected job skipping), dispatch state rows in PENDING status older than your reconciliation interval as a gauge (to detect unresolved dispatches), and reconciliation resolution rate split by outcome type (to distinguish transient from permanent failures).
5. Building the Correctness Layer
Configuring ShedLock correctly reduces the probability that the coordination layer fails. It does nothing to guarantee the outcome when it does fail. That guarantee requires a correctness layer built at the instruction level, independent of the scheduling mechanism.
The dispatch state table is the foundation. It records the status of every payment instruction individually and is the authoritative record of what has and has not been sent. The unique constraint on idempotency_key is the structural enforcement: even if two pods simultaneously attempt to insert a dispatch record for the same payment, the database rejects one of them deterministically, without any application-level locking required.
CREATE TABLE payment_dispatch_state (
payment_id UUID NOT NULL,
idempotency_key VARCHAR(64) NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'PENDING',
dispatched_at TIMESTAMP,
confirmed_at TIMESTAMP,
processor_ref VARCHAR(128),
batch_run_id UUID,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
CONSTRAINT pk_pds PRIMARY KEY (payment_id),
CONSTRAINT uq_pds_idem UNIQUE (idempotency_key)
); The dispatch method checks this table before every submission. When a DuplicateKeyException fires on insert, the record exists, and the method must not dispatch again. Whether the previous dispatch succeeded is a question for the processor to answer via the idempotency key, not for the application to guess.
@Transactional
public void dispatchIfNotAlreadySent(Payment payment) {
String key = buildIdempotencyKey(payment);
try {
dispatchStateRepo.insertPending(payment.getId(), key, currentBatchRunId);
} catch (DuplicateKeyException e) {
DispatchState existing = dispatchStateRepo.findByKey(key);
if (existing.isDispatched() && existing.isAmbiguous()) {
// Sent but no confirmation received. Ask the processor.
ProcessorStatus status = processorClient.getStatus(key);
if (status.isConfirmed()) {
dispatchStateRepo.markConfirmed(key, status.getReference());
}
}
return; // Never dispatch again if the record already exists.
}
ProcessorResponse response = processorClient.submit(payment.toRequest(key));
dispatchStateRepo.markDispatched(key, response.getReference());
} The reconciliation pipeline is the third component. It runs on its own ShedLock-protected schedule, queries the processor for canonical status of any payment in an unresolved state beyond a defined timeout, and updates the dispatch table accordingly. Permanent failures — insufficient funds, invalid account, compliance rejection — must be recorded with a terminal status so the pipeline does not keep attempting to verify records that will never succeed.
@Scheduled(cron = "0 */5 * * * *")
@SchedulerLock(name = "paymentReconciliationJob",
lockAtMostFor = "PT8M", lockAtLeastFor = "PT4M")
public void reconcileStaleDispatches() {
Instant staleBefore = Instant.now().minus(Duration.ofMinutes(15));
List<DispatchState> stale = dispatchStateRepo
.findByStatusAndCreatedBefore(PENDING, staleBefore);
for (DispatchState ds : stale) {
LockExtender.extendActiveLock(Duration.ofMinutes(5), Duration.ZERO);
try {
ProcessorStatus status = processorClient.getStatus(ds.getIdempotencyKey());
if (status.isConfirmed()) {
dispatchStateRepo.markConfirmed(
ds.getIdempotencyKey(), status.getReference());
} else if (status.isNotFound()) {
dispatchStateRepo.markFailed(
ds.getIdempotencyKey(), "NOT_FOUND_AT_PROCESSOR");
}
} catch (ProcessorException e) {
log.warn("Could not determine status for {}", ds.getIdempotencyKey());
}
}
} 6. How to Safely Retry a Failed Batch
A failed batch is not the same as a batch that never ran. Some records were dispatched and some were not, and the retry must know which is which. The principle is that a retry must produce the same outcome as a complete successful run regardless of where the previous attempt failed. Achieving this requires the dispatch state table, the deterministic idempotency keys, and a checkpoint that records how far the previous run progressed so the retry resumes without reprocessing completed work.
The checkpoint table has a compound primary key on job name and processing date. On a fresh run it has no record for the current batch and processing starts from chunk zero. On a retry it reads the last completed chunk number and resumes from there. The checkpoint is written after each chunk is fully processed, not before: if the job fails during chunk seven, the checkpoint records six and the retry picks up at chunk seven.
CREATE TABLE batch_checkpoint (
job_name VARCHAR(64) NOT NULL,
batch_date DATE NOT NULL,
last_chunk INT NOT NULL DEFAULT 0,
updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
CONSTRAINT pk_bc PRIMARY KEY (job_name, batch_date)
); @Transactional
public void runPaymentBatch() {
LocalDate today = LocalDate.now();
int startChunk = checkpointRepo.getLastChunk("paymentBatchJob", today);
List<List<Payment>> chunks = getChunkedPending();
for (int i = startChunk; i < chunks.size(); i++) {
LockExtender.extendActiveLock(Duration.ofMinutes(10), Duration.ZERO);
processChunk(chunks.get(i));
checkpointRepo.save("paymentBatchJob", today, i);
}
checkpointRepo.delete("paymentBatchJob", today);
} The interaction between retry timing and the scheduling interval is easy to overlook. If your batch runs every fifteen minutes and a failure triggers an immediate retry under a different ShedLock job name, the retry can collide with the next scheduled run. The safest model is to let the next scheduled run serve as the retry, relying on the dispatch state table and checkpoint to make it resume-safe. This keeps all payment processing under one ShedLock lock name and avoids a second coordination pathway with its own failure modes.
7. Other Lock Types and When Each Belongs
ShedLock solves the scheduled job coordination problem. A mature payment system requires additional locking mechanisms at different layers, and using ShedLock where you need Redisson, or SELECT FOR UPDATE where you need an advisory lock, produces systems that work well under normal conditions and fail in ways that are hard to diagnose under unusual ones.
Pessimistic row locking with SELECT FOR UPDATE SKIP LOCKED belongs at the work queue layer. When multiple workers process from a shared table, FOR UPDATE acquires an exclusive row lock at read time so no other worker can claim the same record. The SKIP LOCKED clause is essential: without it a second worker blocks waiting for the first worker’s locks to release, which serialises what should be parallel work. This pattern underlies most production job queue libraries including Que for Ruby and Oban for Elixir.
SELECT id, amount, destination_account
FROM payments
WHERE status = 'PENDING'
AND scheduled_date = CURRENT_DATE
ORDER BY created_at
LIMIT 100
FOR UPDATE SKIP LOCKED; Optimistic locking belongs at the payment record layer. Rather than preventing concurrent access upfront, it checks at write time whether the record has changed since it was read. A zero-row result on an update with a version predicate means another transaction got there first, at which point the calling code skips rather than retries. Under high conflict rates the retry overhead becomes significant, but for payment records claimed by exactly one batch run, conflict rates are low enough that optimistic locking is entirely appropriate.
int updated = jdbcTemplate.update(
"UPDATE payments SET status='PROCESSING', version=version+1 " +
"WHERE id=? AND version=? AND status='PENDING'",
payment.getId(), payment.getVersion()
);
if (updated == 0) { return; } // already claimed, skip Redisson belongs at the API request layer, where you need to prevent two concurrent HTTP requests from initiating a payment on behalf of the same customer. Unlike ShedLock, Redisson is designed for arbitrary code blocks rather than scheduled methods, and its watchdog mechanism automatically renews the lease while the holding thread is alive. Always provide an explicit lease time to tryLock, which disables the watchdog and gives a hard upper bound on hold duration. When no lease time is provided the watchdog renews indefinitely, which is dangerous for payment operations.
RLock lock = redissonClient.getLock("payment:customer:" + customerId);
boolean acquired = lock.tryLock(5, 30, TimeUnit.SECONDS);
if (!acquired) {
throw new PaymentConflictException(
"Another payment is in progress for customer " + customerId);
}
try {
initiatePayment(customerId, request);
} finally {
if (lock.isHeldByCurrentThread()) lock.unlock();
} PostgreSQL advisory locks belong at the cooperative fan-out layer. Transaction-level advisory locks release automatically on commit, rollback, or crash, making them safer than session-level locks. Each worker calls pg_try_advisory_xact_lock in non-blocking mode and moves to the next item if the lock is unavailable. Keys are global integers, so the two-argument form with a namespace identifier as the first argument is essential to avoid collisions.
Spring Integration’s JdbcLockRegistry belongs when the lock must participate in the same database transaction as the payment state change it protects. It can lock arbitrary named resources rather than only scheduled methods, making it appropriate for the API-layer locking scenarios ShedLock was not designed to handle.
Each mechanism belongs to a specific layer.
| Layer | Mechanism | Use Case |
|---|---|---|
| Scheduled job | ShedLock | One node runs the batch |
| Work queue | SELECT FOR UPDATE SKIP LOCKED | Multiple workers, no row overlap |
| Payment record | Optimistic versioning | Stale state transition prevention |
| API request | Redisson / JdbcLockRegistry | Duplicate initiation prevention |
| Cooperative fan-out | PostgreSQL advisory locks | Non-blocking queue distribution |
| Coordination removed | Idempotency keys + append-only ledger | No locking needed |
8. What I Actually Recommend
The practical recommendation for a payment system is a layered stack in a specific priority order. First, deterministic idempotency keys sent with every processor call: without this, everything else is probabilistic. Second, a dispatch state table with a unique constraint, which enforces idempotency at your own database level before the call reaches the processor. Third, optimistic versioning on payment records to prevent stale state transitions. Fourth, ShedLock with usingDbTime(), lease durations measured from production p99 data, and chunk-level heartbeating via LockExtender. Fifth, a reconciliation pipeline that queries the processor for any payment in an unresolved state. Sixth, a human intervention path: an operational runbook and tooling for payments the reconciliation pipeline cannot categorise, because there will always be payments that require judgement rather than automation.
Never depend on a distributed lock as the sole correctness mechanism. The Citibank Revlon incident cost half a billion dollars. The TSB and NatWest duplication incidents required emergency remediation and regulatory attention. In every case the system had controls and those controls were insufficient because they relied on coordination rather than structural correctness enforcement. A dispatch state table with a unique constraint on an idempotency key is not a complex piece of engineering. It is a schema addition and a hash function. What it provides is the ability to say with confidence that the architecture cannot produce a duplicate payment, not merely that it is unlikely to.
9. Decision Matrix
| Scenario | Primary Mechanism | Supporting Mechanism |
|---|---|---|
| Single DB, multiple threads | SELECT FOR UPDATE SKIP LOCKED | Optimistic versioning |
| Scheduled jobs, multiple nodes | ShedLock with usingDbTime() | Chunk-level LockExtender heartbeat |
| API-level payment initiation | Redisson or JdbcLockRegistry | Idempotency keys |
| Payment dispatch | Idempotency keys | Dispatch state table |
| Batch retry after failure | Checkpoint table | Dispatch state table |
| Cross-region or high partition risk | Event-driven, append-only | Idempotency keys |
| Unknown or unresolvable state | Reconciliation pipeline | Human intervention path |
References
References are ordered by relevance to the article’s core subject matter.
ShedLock
ShedLock GitHub repository — Lukas Krecan — primary documentation covering lockAtMostFor, lockAtLeastFor, LockExtender, KeepAliveLockProvider, and all supported lock providers.
Lock @Scheduled Tasks With ShedLock and Spring Boot — rieckpil.de — comprehensive practical guide to ShedLock configuration and the shedlock table lifecycle.
Distributed Task Synchronization: Leveraging ShedLock in Spring — DZone — walkthrough of the shedlock table state during task execution with database output examples.
ShedLock in Spring Scheduler: Prevent Duplicate Execution — Medium — multi-instance setup demonstration with parallel port testing.
Real Payment Failures
Citibank accidentally paid Revlon lenders $900 million — Banking Dive — original reporting on the Flexcube software error and human error chain that caused the misdirected wire.
Judge rules Citibank cannot recover $500 million — Berkeley Law — analysis of the discharge-for-value ruling and why the funds could not be recovered.
Citi settles with Revlon lenders — Banking Dive — final resolution of the two-year litigation and Jane Fraser’s description of the incident as a massive unforced error.
Lessons from Citi’s Revlon Error — University of Maryland Smith School — academic analysis of the operational and software control failures.
TSB customers double-charged due to technical error — MoneySavingExpert — March 2022 duplicate payment incident and TSB’s customer impact statement.
TSB IT failures hit 1.9 million customers — BBC News — the independent report’s findings on TSB’s 2018 migration and the board’s lack of basic common sense in system testing.
NatWest mistakenly charges debit card transactions twice — Vixio — NatWest duplicate charge incident, customer overdraft impact, and the multi-day detection gap.
RBS payment failure could last days — BBC News — 600,000 delayed payments including wages and disability allowance, and the parliamentary committee’s response.
Distributed Locking
How to do distributed locking — Martin Kleppmann — definitive analysis of why Redis-based Redlock cannot provide safety guarantees under network partitions.
ShedLock vs Redisson: when to use each — Rohit Malhotra, Medium — practical comparison of ShedLock for scheduled jobs versus Redisson for API-layer locking.
Redisson distributed locks in Java — OneUptime — RLock, tryLock, watchdog behaviour, and when to disable the watchdog with an explicit lease time.
Distributed locks in Spring Boot: ShedLock, Redisson, Spring Integration — Developer Playground — comparison of all three Spring Boot locking approaches with implementation examples.
PostgreSQL advisory locks — Krsingh, Medium — transaction-level advisory lock patterns for fan-out queue scenarios.
FOR UPDATE SKIP LOCKED for queue workflows — Netdata — comparison of advisory locks versus SKIP LOCKED for PostgreSQL queue processing.
PostgreSQL SKIP LOCKED: the job queue pattern — DB Pro — atomic worker claim mechanics and why SKIP LOCKED underlies most production job queue libraries.
Idempotency and Retry
Stripe idempotency keys — Stripe API documentation — native processor-level idempotency key support and deduplication behaviour.
Idempotency best practices — AWS Durable Execution SDK — distinction between exactly-once publishing and exactly-once outcomes, and key generation inside durable steps.
Idempotency and retry safety in distributed workflows — Orkes — idempotency store pattern and the Citibank Revlon incident as a distributed systems correctness failure.
Batch processing retry strategies — OneUptime — checkpoint patterns, dead-letter handling, and why permanent failures should route to separate alert channels.
Optimistic versus pessimistic locking — USAVPS — trade-off analysis between conflict rate, throughput, and retry overhead.