Real-Time Data Streaming: Change Data Capture (CDC) in PostgreSQL
In modern, event-driven microservice architectures, data needs to flow seamlessly between systems. When a user updates their profile in the main database, the search index (Elasticsearch) needs to update, the cache (Redis) needs to invalidate, and the analytics data warehouse needs to log the event.
Historically, engineers solved this with Dual Writes: writing to the database and publishing to a message broker (like Kafka) in the same application function.
This is a massive anti-pattern. If the database write succeeds but the Kafka publish fails, your systems are immediately out of sync.
The industry solution to this problem is Change Data Capture (CDC). CDC treats the database itself as the ultimate source of truth, reading changes directly from the database’s internal logs and streaming them downstream.
1. How PostgreSQL CDC Works: Logical Decoding
PostgreSQL achieves CDC through a mechanism called Logical Decoding.
Every relational database has a transaction log. In Postgres, this is called the WAL (Write-Ahead Log). Before Postgres actually modifies the data files on the hard drive, it appends the change to the WAL. This ensures that if the server crashes during a write, it can replay the WAL on reboot to recover the data.
Logical Decoding allows external systems to read this WAL. The process relies on three Postgres concepts:
wal_level = logical: You must configure Postgres to write extra metadata into the WAL so that external systems can understand the logical row changes (rather than just raw disk block changes).- Replication Slots: A marker that tells Postgres, “Do not delete this section of the WAL, an external consumer still needs to read it.”
- Publications: A filter that specifies exactly which tables’ changes should be broadcasted.
2. Setting Up Postgres for CDC
Let’s walk through configuring a Postgres database to emit CDC events.
Step 1: Configure postgresql.conf
You must enable logical replication. This requires a restart of the PostgreSQL server.
# postgresql.conf
wal_level = logical
max_replication_slots = 10 # Max number of external consumers
max_wal_senders = 10 # Max concurrent WAL streaming connections
Step 2: Create a Publication
We don’t necessarily want to stream every table. Let’s assume we have an e-commerce database and we only want to stream changes from the orders and users tables.
-- Run this as a superuser in your database
CREATE PUBLICATION my_cdc_publication FOR TABLE orders, users;
Step 3: Set up a Logical Replication Slot
External tools usually do this automatically, but you can see how it works manually using the built-in pgoutput plugin.
SELECT * FROM pg_create_logical_replication_slot('my_cdc_slot', 'pgoutput');
3. Consuming CDC Streams with Debezium
While you could write a custom script to connect to the replication slot and read the raw binary stream, nobody does this in production. The industry standard is Debezium, an open-source distributed platform for CDC built on top of Apache Kafka.
Debezium acts as the connector. It connects to the Postgres replication slot, decodes the WAL into friendly JSON, and pushes those JSON events into a Kafka topic.
Configuring the Debezium Postgres Connector
You deploy Debezium via Kafka Connect and provide it a configuration payload specifying your database details and the publication you created.
{
"name": "postgres-inventory-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres.internal",
"database.port": "5432",
"database.user": "cdc_user",
"database.password": "supersecret",
"database.dbname": "ecommerce_db",
"database.server.name": "dbserver1",
// The publication and slot we set up
"publication.name": "my_cdc_publication",
"slot.name": "my_cdc_slot",
"plugin.name": "pgoutput"
}
}
The CDC Payload
Once running, if an application runs UPDATE users SET status = 'active' WHERE id = 42;, Debezium instantly captures this from the WAL and pushes a payload to Kafka that looks like this:
{
"before": {
"id": 42,
"email": "user@example.com",
"status": "pending"
},
"after": {
"id": 42,
"email": "user@example.com",
"status": "active"
},
"source": {
"version": "2.4.0.Final",
"connector": "postgresql",
"db": "ecommerce_db",
"schema": "public",
"table": "users",
"txId": 15023
},
"op": "u", // 'c' for create, 'u' for update, 'd' for delete
"ts_ms": 1696081423451
}
This payload is incredibly powerful. Because it contains both the before and after state, downstream microservices have total context of the change without ever needing to query the database.
4. Use Cases and Architecture
By piping these CDC events into Kafka, you unlock highly decoupled, real-time architectures:
- Cache Invalidation: A Redis worker listens to the Kafka topic and immediately invalidates or updates the cached user profile when a database update occurs.
- Search Indexing: An Elasticsearch synchronizer consumes the stream to ensure the search index is never more than a few milliseconds behind the main database.
- Data Warehousing / Analytics: Stream changes directly into Snowflake or BigQuery for real-time dashboards without running expensive batch ETL jobs.
- Outbox Pattern: Store events in a database table as part of a transaction, and use CDC to reliably forward those events to other microservices.
5. The Critical Danger: Disk Space Exhaustion
CDC is powerful, but it introduces a severe operational risk regarding Replication Slots.
Remember that a Replication Slot tells Postgres to keep the WAL files on disk until the consumer (Debezium) acknowledges it has processed them.
What happens if Debezium crashes or the network disconnects for 2 days?
Postgres will obediently hold onto every single WAL file generated during those 2 days. The pg_wal directory will grow exponentially until the hard drive reaches 100% capacity. When the hard drive is full, Postgres will crash and refuse to start.
Mitigation
To prevent an offline consumer from bringing down your entire database, always configure the max_slot_wal_keep_size parameter (available in Postgres 13+).
# If a slot falls behind by 50GB, Postgres drops the slot and deletes the WAL
# Better to lose CDC data than crash the production database.
max_slot_wal_keep_size = 50GB
Conclusion
Change Data Capture transforms your database from a passive storage container into the active, beating heart of an event-driven architecture. By leveraging PostgreSQL’s logical decoding alongside tools like Debezium, you guarantee data consistency across your entire infrastructure without the race conditions and dual-write headaches of the past.
Ready to Level Up Your SQL?
Don't just read about databases. Write queries, build schemas, and practice real-world interview questions interactively.
Start Learning for Free