to track and capture changes in a database so that data stored in other systems remains synchronized with the source database.
this ensures data consistency across distributed systems while minimizing the need for full database snapshots.
Ensures historical tracking of data changes, which is crucial for auditing and compliance.
Helps microservices communicate efficiently when database changes occur.
Reduces processing overhead.
CDC can be implemented in different ways based on how data updates are detected and transmitted.
The source system pushes changes to the target system.
Enables real-time updates but risks data loss if the target system is unreachable.
Example: A purchase order microservice pushes updates to a shipping service.
The target system polls the source database at intervals to fetch updates. > latency(batch)
Example: A reporting tool queries a sales database every 10 minutes for new orders.
There are several ways to implement CDC:
Queries the source database for changes using a timestamp column (e.g., updated_at).
Pros: Simple to implement.
Cons: Adds computational overhead as every row needs to be scanned.
Reads changes directly from the database transaction log.
Pros: Low overhead, real-time event tracking.
Cons: Requires database support (e.g., MySQL binlog, PostgreSQL WAL).
Example: Debezium streams database changes to Apache Kafka.
Uses database triggers to detect changes and notify the CDC system.
Pros: Offloads CDC processing from the database.
Cons: Triggers can slow down writes if overused.
Popular CDC Tools
Debezium – Log-based CDC for various databases.
AWS DMS – Database Migration Service supporting CDC.
Kafka Connect API – Streams CDC events via Kafka.
Airbyte – Open-source ETL tool supporting CDC.
Choosing the Right Data Ingestion Tool
When selecting a data ingestion tool, you must evaluate both the data characteristics and the tool’s reliability.
Data Type & Structure:
Structured (e.g., relational databases), semi-structured (e.g., JSON, XML), or unstructured (e.g., images).
Ensure your ingestion tool supports transformations for different formats.
Data Volume:
Batch ingestion: Consider the total dataset size and network limitations.
Streaming ingestion: Check maximum message size limits (e.g., Kafka allows up to 20MB or more, Amazon Kinesis supports 1MB).
Future Data Growth:
Anticipate how data volume will grow daily, monthly, or yearly.
Choose scalable tools that can handle increasing workloads.
Latency Requirements:
Batch ingestion works for periodic updates (e.g., daily ETL jobs).
Streaming ingestion is needed for real-time processing (e.g., fraud detection).
Data Quality:
Ensure the ingestion tool supports error handling, deduplication, and data validation.
Schema Changes:
Use ingestion tools that auto-detect schema changes if updates are frequent.
Maintain good communication with upstream data producers.
Reliability:
Ensure the ingestion pipeline is fault-tolerant to prevent data loss.
Example: IoT event streams must be ingested before they expire.
Durability:
Avoid data corruption and design for redundancy.
Trade-offs: Weigh cost vs. risk of losing data.