Your Database Writes Twice: How the Write-Ahead Log Actually Works

8 min read

Your Database Writes Twice: How the Write-Ahead Log Actually Works

When you run an INSERT in Postgres, what do you think happens? Most developers picture a straightforward flow: the database receives your SQL, finds the right spot on disk, writes the row, and confirms the commit. One write, one confirmation.

That's not what happens. The database writes twice. And the reason it does this is one of the most elegant trade-offs in all of systems engineering. Understanding it changes how you think about durability, crash recovery, replication, and eventually patterns like CDC with tools like Debezium.

The real flow of a write operation

Your application sends an INSERT through a connection (or a connection pool). Postgres receives it, parses the SQL, generates a query plan, and starts executing. So far, nothing surprising.

Here's where it gets interesting. Postgres modifies the data pages in memory, inside the shared buffer pool. These modified pages are called dirty pages. The actual data files on disk haven't been touched yet. Your row only exists in RAM at this point.

Now, before Postgres can tell your application "commit done, you're good", it has to guarantee that the change won't be lost if the server crashes in the next millisecond. So it writes the changes to the WAL (Write-Ahead Log) on disk, with an fsync to make sure the kernel actually flushes the data to the physical device. Only after the WAL write is confirmed does Postgres send the commit acknowledgment back to your app.

The dirty pages sitting in the shared buffer pool? They get flushed to the actual data files later, in the background, by a process called the checkpointer.

┌─────────────────────────────────────────────────────────────────────┐
│                        YOUR APPLICATION                             │
│                   INSERT INTO orders VALUES (...)                    │
└──────────────────────────────┬──────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────┐
│                        POSTGRES ENGINE                              │
│                                                                     │
│  1. Parse SQL → generate execution plan                             │
│  2. Modify pages in Shared Buffer Pool (RAM) → dirty pages          │
│  3. Write to WAL on disk (sequential append + fsync) ← DURABILITY   │
│  4. Return COMMIT OK to client                                      │
│                                                                     │
│  ... later, in background ...                                       │
│                                                                     │
│  5. Checkpointer flushes dirty pages → data files on disk           │
└─────────────────────────────────────────────────────────────────────┘

Two things to understand about why this design exists.

First, the WAL doesn't store the SQL you wrote. It doesn't record "INSERT INTO orders VALUES (5, 100.50)". It records physical changes: "in page 4523, at offset 12, write these bytes". It operates at the page level, not the SQL level. This distinction matters a lot when we get to replication.

Second, the reason for writing twice is pure physics. Writing sequentially to the end of a file (append-only) is drastically faster than doing random I/O to update pages scattered across the disk. The WAL gives you durability cheaply with fast sequential writes. The expensive random writes to the actual data files happen later, in background, batched together by the checkpointer. You get the safety of immediate persistence with the performance of deferred writes.

Crash recovery: the whole point

The entire design of the WAL exists for one scenario: what happens when the server crashes between the WAL write and the checkpoint.

Normal operation:
  WAL write (fsync) ──→ COMMIT OK ──→ ... time passes ... ──→ Checkpoint flushes to disk

Crash scenario:
  WAL write (fsync) ──→ COMMIT OK ──→ 💥 CRASH (dirty pages lost from RAM)

Recovery:
  Restart ──→ Read WAL from last checkpoint ──→ Replay unflushed operations ──→ Consistent state

When Postgres starts up after a crash, it doesn't panic. It knows exactly what to do: read the WAL from the last known checkpoint and replay any operations that were committed but never made it to the data files. After the replay, the database is in a fully consistent state. No data lost, no corruption.

The LSN and the Checkpointer

Every WAL entry gets a Log Sequence Number (LSN), which is a monotonically increasing offset into the WAL stream. Think of it as a position marker. LSN 5000 came before LSN 6000, and both came before LSN 7000.

The checkpointer, when it flushes dirty pages to disk, records the LSN up to which everything has been persisted. This is the checkpoint LSN. If Postgres crashes and restarts, it says "my last checkpoint was at LSN 5000, I'll replay the WAL from there". Everything before LSN 5000 is already safely on disk in the data files. Only the operations after that need to be replayed.

The WAL itself is divided into segment files, 16MB each by default. Once a segment only contains operations older than the checkpoint LSN, it can be recycled. The WAL doesn't grow infinitely. Old segments get reused, new segments get created. It's essentially a ring buffer of files.

If this sounds familiar, it should. It's the same concept as consumer offsets in Kafka. A Kafka consumer tracks an offset that says "I've processed up to here". The checkpointer tracks an LSN that says "I've persisted up to here". If something fails, you resume from the last confirmed position. An append-only log plus a cursor is the fundamental abstraction behind half of distributed systems.

Replication slots: how other systems hook into the WAL

Now that you understand the WAL, the next question is natural: can other systems read it? Yes. That's exactly what replication slots are for.

A replication slot is a persistent object inside Postgres that survives restarts. It tracks a consumer of the WAL stream. You can see all active slots with a simple query.

SELECT slot_name, slot_type, plugin, confirmed_flush_lsn
FROM pg_replication_slots;

Each slot records a name, a type (physical or logical), a decoding plugin (for logical slots), and a confirmed_flush_lsn which tells Postgres "this consumer has read up to here".

One important detail: the connection that a replica or Debezium uses to consume the WAL is not a regular SQL connection. Postgres has two distinct protocols. The standard one handles your normal queries. The streaming replication protocol is a separate thing entirely, activated by setting replication=database in the connection string.

You can have multiple replication slots active at the same time. One for a read replica, one for Debezium, one for another service. Each slot maintains its own LSN independently. Postgres retains WAL segments until the slowest consumer has confirmed reading them.

⚠️Watch your slots

An abandoned replication slot that nobody is consuming will prevent Postgres from recycling WAL segments. The WAL will keep growing until it fills your disk. This is the number one thing to monitor when you set up replication or CDC. If you decommission a consumer, drop the slot.

Physical vs logical replication

Here's where the WAL's physical nature creates an interesting problem. If the WAL stores changes as "write these bytes at this page offset", how can you replicate to a system that has a completely different disk layout? Or to a different database entirely?

The answer is that Postgres supports two kinds of replication, and you choose which one explicitly when you create the slot.

Physical replication sends the raw WAL bytes to the replica. This works because the replica is a byte-for-byte copy of the primary. Same Postgres version, same architecture, same disk layout. The replica applies the same page-level changes to its own files. It's fast and simple, but completely inflexible. You can't replicate to a different system or even a different Postgres major version.

SELECT pg_create_physical_replication_slot('replica1');

Logical replication takes a different approach. Postgres passes the physical WAL through a decoding plugin (pgoutput is the default) that translates the page-level changes back into semantic operations: "INSERT into table orders, column id=5, column total=100.50". Now the changes have meaning that any system can understand, regardless of its internal storage format. This is what makes it possible for Debezium, or another Postgres instance with a different schema, or a completely different database to consume the changes.

SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput');

The physical WAL is always generated. Every write goes through the same path we discussed earlier. Logical decoding is an additional step that only happens when a logical slot actively consumes the stream. Postgres doesn't guess which type of replication you want. You declare it when you create the slot.

Why this matters beyond durability

Understanding the WAL gives you a mental model for a whole family of patterns. Durability and crash recovery are just the starting point. Replication slots let you build read replicas. Logical decoding opens the door to Change Data Capture, where tools like Debezium tail the WAL and publish every database change as an event to Kafka or another message broker. And CDC is the foundation for patterns like the transactional outbox, where you write an event to a table inside the same transaction as your business data, and Debezium picks it up from the WAL and publishes it downstream.

It all starts with one simple design decision: write to an append-only log before touching the data files. Everything else flows from that.

TL;DR
  • Postgres writes changes to the WAL (sequential, append-only) before confirming COMMIT. Dirty pages in RAM get flushed to disk later by the checkpointer.
  • The WAL stores physical page-level changes (bytes + offsets), not SQL statements.
  • Sequential writes (WAL) are fast. Random writes (data files) are slow. This split gives you cheap durability with deferred expensive I/O.
  • Each WAL entry has an LSN (Log Sequence Number). The checkpointer tracks the last persisted LSN. Crash recovery replays the WAL from that point.
  • Replication slots are persistent consumers of the WAL stream. Each tracks its own confirmed LSN.
  • Physical replication sends raw WAL bytes (fast, but replica must be identical). Logical replication decodes WAL into semantic changes (flexible, works with external systems like Debezium).
  • Abandoned replication slots prevent WAL recycling and will fill your disk. Always monitor and drop unused slots.