Architecture10 min read

Durable Execution vs. the Saga Pattern for Payment Workflows

Saga compensations are necessary but insufficient. Durable execution provides exactly-once guarantees that financial processes require.

Durable Execution vs. Saga Pattern: Choosing the Right Workflow Model for Payments

Step 3 of 5 fails. The queue consumer crashes. On restart, does it replay from step 1? Skip to step 4? The answer depends on state that the queue doesn't track.

This is the central problem of multi-step financial processes. Account opening requires four coordinated steps: create entity, provision ledger accounts, assign IBAN, trigger KYC verification. Payment execution requires five: validate, screen, debit, submit to clearing, track settlement. Each step depends on the previous one. Each step has side effects that cannot be casually repeated.

Two architectural patterns address this problem. They solve it differently, with different trade-offs, and the choice matters more than most architecture decisions in a payment system.

The Saga Pattern

Sagas coordinate independent services through compensating transactions. Two variants exist.

Choreography: each service publishes events. The next service subscribes and acts. When something fails, the failed service publishes a compensation event, and upstream services listen and undo their work.

The appeal is decoupling. No central coordinator. Each service owns its logic. Add a new step by adding a new subscriber.

The problem is visibility. The "process" exists nowhere as a single artifact. It is an emergent property of event flows across services. Debugging a failed payment requires correlating logs across every consumer that participated. If an event was lost (and events do get lost, network partitions, consumer crashes before acknowledgment, dead-letter overflow), the failure is silent. No component knows the process is incomplete, because no component tracks the process.

Orchestration: a central coordinator sends commands to each service in sequence. The coordinator knows the current step. If a step fails, the coordinator executes compensation in reverse order.

This is better. The process is visible in one place. An operator can inspect the coordinator's state and see: step 3 of 5 is pending. The coordinator is the single source of truth for the process.

But: the coordinator itself must survive failures. If the coordinator crashes between step 3 (debit the account) and step 4 (submit to clearing), what happens? The answer depends on how the coordinator persists its own state. If it uses an in-memory state machine, the state is lost. If it persists to a database, recovery depends on whether the database write committed before the crash. The coordinator's own reliability becomes a design problem, one that Saga orchestration does not inherently solve.

Both Saga variants share a fundamental property: each step is an independent transaction. The saga coordinates them. But the coordination itself is not transactional. The gap between "step 3 committed" and "the coordinator recorded that step 3 committed" is where data loss hides.

Durable Execution

Durable execution takes a different approach. The workflow engine journals every step BEFORE executing it. The journal is the process state. If the engine crashes and restarts, it replays the journal and resumes at the exact point of interruption.

The mental model is different from Sagas. In a Saga, you define the forward path and the compensation path as separate concerns. In durable execution, you write the workflow as a linear function, step 1, step 2, step 3, and the engine guarantees that the function completes. If a step fails permanently (not a transient error, but a genuine business failure, "account does not exist"), the engine executes compensation as part of the same journaled workflow.

Three properties distinguish durable execution from Saga orchestration:

The journal survives process restarts. The engine writes each step to a durable log before execution. On restart, it replays the log and resumes. The journal is not an optimization or an afterthought, it is the execution mechanism. The process does not run "on top of" the journal. The process IS the journal, replayed.

Idempotency is guaranteed by the engine. In a Saga, every service must implement idempotency independently. If the coordinator retries step 3, the service must detect the duplicate and return the previous result. In durable execution, the engine tracks which steps have completed. On replay, completed steps return their cached results without re-executing. The service does not need to be idempotent, the engine ensures it is called at most once per workflow instance.

The journal is the audit trail. Every workflow has a correlation ID. Every step is recorded with inputs, outputs, duration, and outcome. An auditor can reconstruct the complete lifecycle of a payment from a single identifier. Not a logging feature, a structural property of the execution model. The audit trail exists because the journal exists, and the journal exists because the engine cannot function without it.

A Concrete Example: SEPA Credit Transfer

Walk through a payment with both patterns.

The happy path (both patterns handle this identically):

Validate IBAN and amount ✓
AML screening (external provider) ✓
Debit sender account (ledger) ✓
Submit to clearing (SEPA SCT via CSM) ✓
Track settlement status ✓

The failure: step 4 is rejected by the clearing network.

The sender's account has been debited (step 3). The clearing network rejected the submission (step 4). The debit must be reversed.

Saga approach:

The clearing adapter publishes a PaymentRejected event. A compensation consumer picks it up, calls the ledger to reverse the debit, and publishes DebitReversed. The original payment record is updated to FAILED.

But: what if the compensation consumer crashes before reversing the debit? The PaymentRejected event sits in the dead-letter queue. The sender's account remains debited. No component is actively tracking this state. Discovery depends on either a reconciliation run (hours or days later) or a customer complaint.

Mitigation exists: persistent queues with at-least-once delivery, consumer health monitoring, dead-letter alerting. Each mitigation is an additional system to build and maintain. The compensation path becomes as complex as the forward path.

Durable execution approach:

Step 4 fails. The engine records the failure in the journal. It executes the compensation handler (reverse the debit) as the next journaled step. If the engine crashes during the reversal, it replays the journal on restart and re-executes the reversal (which is idempotent because the engine cached the step's state).

The complete sequence, forward path, failure, compensation, is in one journal with one correlation ID. An operator inspects the workflow and sees:

Workflow: SEPA-CT-2026-02-21-00847
  Step 1: Validate        → OK (12ms)
  Step 2: AML Screen      → OK (340ms, provider: screening-svc)
  Step 3: Debit            → OK (0.4ms, transfer_id: 9a3f...)
  Step 4: Submit Clearing  → REJECTED (reason: AC03, invalid creditor account)
  Step 5: Compensate Debit → OK (0.3ms, reversal_id: b7c1...)
  Status: COMPENSATED

No forensic log analysis. No dead-letter queue to monitor. No reconciliation required to discover the failure. The process is self-documenting.

Where Each Pattern Fits

Aspect	Saga (Choreography)	Saga (Orchestration)	Durable Execution
Process visibility	Distributed across event logs	Central coordinator	Single journal with full replay
Recovery model	Compensation events (may be lost)	Coordinator resumes (if state persisted)	Journal replay (guaranteed)
Audit trail	Requires log aggregation across services	Coordinator log (single point)	Built-in, journal IS the trail
Idempotency	Each service must implement	Each service must implement	Engine-guaranteed
Partial failure	Silent if compensation event lost	Uncertain if coordinator crashes mid-step	Journal survives, replay resumes
Complexity	Low (per service), high (system-wide)	Medium (coordinator), medium (services)	Low (workflow code), low (infrastructure)
Best for	Loosely coupled, low-stakes coordination	Multi-service orchestration	Financial processes requiring exactly-once

Sagas are not wrong. They work well for coordination that can tolerate eventual consistency: inventory reservations, notification workflows, analytics pipelines. The choreography variant excels when services are truly independent and the "process" is a convenience, not a requirement.

For financial processes, where every step has irreversible side effects, where compensation must be provable, where an auditor will ask "show me every step of this payment", durable execution provides stronger guarantees with less operational complexity.

The Industry Has Converged

The evidence is practical, not theoretical. The payment industry has made its choice:

Stripe uses Temporal (durable execution) for payment orchestration.
Revolut uses Temporal for critical payment flows.
Wise migrated from Saga orchestration toward Temporal.
N26 uses Kafka + Sagas, and publicly discusses the operational complexity of compensation management.

The pattern is clear. For payment-critical paths, the industry is converging on durable execution. The journal-based model provides the traceability that regulators require (DORA Art. 11-12) and the operational visibility that engineering teams need.

Sources:

Garcia-Molina, Hector and Salem, Kenneth. "Sagas." SIGMOD '87, the original Saga pattern paper
DORA, Regulation (EU) 2022/2554, Art. 11-12 (ICT traceability and recovery)
Restate documentation: "Microservice Orchestration" (https://docs.restate.dev/tour/microservice-orchestration)
Temporal documentation: "What is durable execution?" (https://docs.temporal.io/concepts)