Real-World Distributed Transactions & Saga Pattern Scenarios
Real-World Distributed Transactions & Saga Pattern Scenarios (Java Edition)
This is a step-by-step guide that teaches distributed transactions and Sagas the way a senior architect would explain to a new engineer. We keep the language simple and show concrete Java code for every idea.
1️⃣ Payment succeeded, inventory failed — how to keep data consistent?
Problem (plain words)
Order → Payment → Inventory. Payment captured successfully. Inventory reservation failed. We must not leave the system in a half-done state.
Why Saga, not a global transaction?
- Global (XA/2PC) locks are fragile and slow in distributed systems.
- Sagas use local transactions per service and “undo” steps (compensations) if something downstream fails.
How it works (orchestrated Saga)
- Create order as PENDING.
- Capture payment.
- Reserve inventory.
- If inventory fails, refund payment and cancel order.
Orchestrator state (mental model)
PENDING → PAID → RESERVED → CONFIRMED ↘ (if fail) REFUNDING → CANCELLED
Java code — Order Orchestrator (Spring Boot style)
@Service @RequiredArgsConstructor public class OrderSagaOrchestrator { private final PaymentClient paymentClient; private final InventoryClient inventoryClient; private final OrderRepository orderRepo; private final SagaStepLogRepository stepLogRepo; @Transactional public void startSaga(UUID orderId) { Order order = orderRepo.findById(orderId).orElseThrow(); logStep(orderId, "CreateOrder", "DONE"); try { paymentClient.capture(orderId, order.getAmount()); logStep(orderId, "ProcessPayment", "DONE"); inventoryClient.reserve(orderId, order.getItems()); logStep(orderId, "ReserveInventory", "DONE"); order.markConfirmed(); orderRepo.save(order); } catch (Exception ex) { compensate(order); } } @Transactional protected void compensate(Order order) { UUID orderId = order.getId(); if (wasDone(orderId, "ProcessPayment") && !wasDone(orderId, "RefundPayment")) { paymentClient.refund(orderId); logStep(orderId, "RefundPayment", "COMPENSATED"); } order.markCancelled("Inventory failed"); orderRepo.save(order); } private boolean wasDone(UUID orderId, String step) { return stepLogRepo.existsByOrderIdAndStepAndStatus(orderId, step, "DONE"); } private void logStep(UUID orderId, String step, String status) { stepLogRepo.save(new SagaStepLog(orderId, step, status)); } }
Saga step log table (tracks progress)
CREATE TABLE saga_step_log ( id UUID PRIMARY KEY, order_id UUID NOT NULL, step TEXT NOT NULL, status TEXT NOT NULL, created_at TIMESTAMP DEFAULT now() );
2️⃣ Kafka event published twice — consumers acted twice. What now?
Problem
At-least-once delivery can produce duplicates. We must make the consumer idempotent.
Approach
- Each message has a stable
eventId(UUID). - Consumer stores processed IDs in a table (or cache) and skips duplicates.
Java code — Idempotent Consumer (Spring Kafka)
@Component @RequiredArgsConstructor public class InventoryConsumer { private final ProcessedEventRepository processedRepo; private final InventoryService inventoryService; @KafkaListener(topics = "payment-events", groupId = "inventory") public void onPaymentCaptured(PaymentCapturedEvent evt) { if (processedRepo.existsByEventId(evt.getEventId())) return; // duplicate try { inventoryService.reserve(evt.getOrderId(), evt.getItems()); processedRepo.save(new ProcessedEvent(evt.getEventId())); } catch (Exception e) { throw e; // let retry/DLQ handle } } }
Transactional Outbox (producer side)
Write DB change and outbox row in the same transaction. A CDC tool publishes to Kafka.
@Entity @Table(name = "outbox") public class OutboxEvent { @Id UUID eventId; String aggregateId; // orderId or paymentId String type; // e.g., PaymentCaptured @Lob String payload; // JSON Instant createdAt; }
3️⃣ Three services, one business transaction — 2PC or Saga?
- Prefer Saga. 2PC hurts availability and performance.
- Use orchestration if you need clear control and monitoring.
- Use choreography if steps are simple and loosely coupled (each service reacts to events).
Java code — Choreography example (events only)
// Payment publishes PaymentCaptured → Inventory listens and reserves → Order listens and confirms
4️⃣ Rollback step times out — how to reach a consistent end state?
Strategy
- Retry with backoff and jitter.
- After N failures, send to DLQ.
- Keep Saga step status so operators can replay compensations from DLQ safely.
Java code — Retry and DLQ (Spring Kafka)
@Bean public ConcurrentKafkaListenerContainerFactory<String, String> kafkaFactory( ConsumerFactory<String, String> cf, KafkaTemplate<String, String> template) { var factory = new ConcurrentKafkaListenerContainerFactory<String, String>(); factory.setConsumerFactory(cf); factory.setCommonErrorHandler(new DefaultErrorHandler( new DeadLetterPublishingRecoverer(template), new ExponentialBackOffWithMaxRetries(3) )); return factory; }
5️⃣ Partial failure made data inconsistent — how to detect and fix?
Detect
- Nightly job that compares authoritative sources (e.g., Orders vs Payments vs Inventory).
- Audit log or event store to trace truth.
Fix
- Produce corrective events (e.g.,
OrderCancelled,RefundPayment). - For hard cases, provide a small admin tool to mark and replay compensations.
Java code — Simple reconciliation job
@Component @RequiredArgsConstructor public class ReconciliationJob { private final OrderRepository orderRepo; private final PaymentRepository paymentRepo; private final InventoryRepository inventoryRepo; @Scheduled(cron = "0 0 * * * *") public void reconcile() { var orders = orderRepo.findAllPendingOlderThan(Duration.ofHours(1)); for (Order o : orders) { boolean paid = paymentRepo.existsCaptured(o.getId()); boolean reserved = inventoryRepo.existsReserved(o.getId()); if (paid && !reserved) { // trigger compensation: publish event or call orchestrator to refund } } } }
6️⃣ Orchestrator is a single point of failure — how to mitigate?
- Run orchestrator instances behind a queue; each Saga instance is owned by one worker (sharding by
sagaId). - Persist Saga state in DB so another instance can continue if one dies.
- Support message replay; handlers must be idempotent.
Java code — Persisted Saga state (simplified)
@Entity public class SagaInstance { @Id UUID sagaId; UUID orderId; String state; // PENDING, PAID, RESERVED, CONFIRMED, CANCELLED Instant updatedAt; }
7️⃣ “Exactly once” across services — what is realistic?
Aim for “effectively once”.
- Producer: Outbox + CDC.
- Consumer: Idempotency + dedup table.
- Optional: Store consumed offset with your write in one local DB transaction.
Java code — Store offset with business write
@Transactional public void handle(Event evt) { if (offsetRepo.exists(evt.getPartition(), evt.getOffset())) return; businessWrite(evt); offsetRepo.save(new ConsumedOffset(evt.getPartition(), evt.getOffset())); }
Extra scenario prompts (practice)
- Compensation arrives out of order — enforce sequence numbers per aggregate and detect gaps.
- Non-compensatable step (email sent) — use forward-only Saga; send correction email if needed.
- Hot partition causes retry storms — add circuit breakers, rate limits, and per-key backoff.
- Replay overwhelms downstream — throttle replay, separate catch-up queues.
End-to-end Java example (mini blueprint)
1) Outbox entity and save helper
@Entity @Table(name = "outbox") public class OutboxEvent { @Id UUID eventId; String type; String aggregateId; @Lob String payload; Instant createdAt = Instant.now(); } @Transactional public void capturePaymentAndOutbox(UUID orderId, BigDecimal amount) { paymentRepo.markCaptured(orderId, amount); OutboxEvent evt = new OutboxEvent(); evt.eventId = UUID.randomUUID(); evt.type = "PaymentCaptured"; evt.aggregateId = orderId.toString(); evt.payload = toJson(Map.of("orderId", orderId.toString())); outboxRepo.save(evt); }
2) CDC publishes to Kafka (Debezium config not shown).
3) Inventory consumer (idempotent)
@KafkaListener(topics = "payment-events", groupId = "inventory") public void onPaymentCaptured(String json) { PaymentCapturedEvent evt = parse(json); if (processedRepo.existsByEventId(evt.getEventId())) return; inventoryService.reserve(evt.getOrderId(), evt.getItems()); processedRepo.save(new ProcessedEvent(evt.getEventId())); }
4) Orchestrator state transitions stored in DB so crash recovery is safe.
If you remember three things: prefer Sagas over global locks, make every step idempotent, and keep an audit trail (outbox + logs). That’s how you build reliable distributed systems.
Related Articles
Incident Playbook for Beginners: Real-World Monitoring and Troubleshooting Stories
A story-driven, plain English incident playbook for new backend & SRE engineers. Find, fix, and prevent outages with empathy and practical steps.
System Design Power-Guide 2025: What To Learn, In What Order, With Real-World Links
Stop bookmarking random threads. This is a tight, no-fluff map of what to study for system design in 2025 - what each topic is, why it matters in interviews and production, and where to go deeper.
DSA Patterns Master Guide: How To Identify Problems, Pick Patterns, and Practice (With LeetCode Sets)
A practical, pattern-first road map for entry-level engineers. Learn how to identify the right pattern quickly, apply a small algorithm template, know variants and pitfalls, and practice with curated LeetCode problems.