Debezium for PostgreSQL to Kafka

Debezium for PostgreSQL to Kafka
In this article, we discuss the necessity of segregate data model for read and write and use event sourcing for capture detailed data changing. These two aspects are critical for data analysis in big data world. We will compare some candidate solutions and draw a conclusion that CDC strategy is a perfect match for CQRS pattern.

Context and Problem

To support business decision-making, we demand fresh and accurate data that’s available where and when we need it, often in real-time.

But,
- as business analysts try to run analysis, the production databases are (will be) overloaded;
- some process details (transaction stream) valuable for analysis may have been overwritten;
- OLTP data models may not be friendly to analysis purpose.
We hope to come out with a efficient solution to capture detailed transaction stream and ingest data to Hadoop for analysis.

CQRS and Event Sourcing Pattern

CQRS-based systems use separate read and write data models, each tailored to relevant tasks and often located in physically separate stores.

Event-sourcing: Instead of storing just the current state of the data in a domain, use an append-only store to record the full series of actions taken on that data.

Decouple: one team of developers can focus on the complex domain model that is part of the write model, and another team can focus on the read model and the user interfaces.

Ingest Solutions - dual writes

Dual Write
- brings complexity in business system
- is less fault tolerant when backend message queue is blocked or under maintenance
- suffers from race conditions and consistency problems
Business log
- concerns of data sensitivity
- brings complexity in business system
Ingest Solutions - database operations

Snapshot
- data in the database is constantly changing, so the snapshot is already out-of-date by the time it’s loaded
- even if you take a snapshot once a day, you still have one-day-old data in the downstream system
- on a large database those snapshots and bulk loads can become very expensive
Data offload
- brings operational complexity
- is inability to meet low-latency requirements
- can’t handle delete operations
Ingest Solutions - capture data change

process only “diff” of changes
- write all your data to only one primary DB;
- extract two things from that database:
- a consistent snapshot and
- a real-time stream of changes
Benefits:
- decouple with business system
- get a latency of less than a second
- stream is ordering of writes, less race conditions
- pull strategy is robust to data corruption (log replaying)
- support as many variant data consumers as required
Ingest Solutions - wrapup

Considering data application under the picture of business application, we will focus on the ‘capture changes to data’ components.

Open Source for Postgres to Kafka

**Sqoop **
can only take full snapshots of a database, and not capture an ongoing stream of changes. Also, transactional consistency of its snapshots is not wells supported (Apache).
pg_kafka
is a Kafka producer client in a Postgres function, so we could potentially produce to Kafka from a trigger. (MIT license)
bottledwater-pg
is a change data capture (CDC) specifically from PostgreSQL into Kafka (Apache License 2.0, from confluent inc.)
debezium-pg
is a change data capture for a variety of databases (Apache License 2.0, from redhat)

Debezium for Postgres is comparatively better.

Debezium for Postgres Architecture

debezium/postgres-decoderbufs
- manually build the output plugin
- change PG configuration, preload the lib file and restart PG service
debezium/debezium
- compile and package the dependent jar files
Kafka connect
- deploy distributed kafka connect service
- start a debezium connector in Kafka connect
HBase connect
- development work: implement a hbase connect for PG CDC events
- Start a hbase connector in Kafka connect
Spark streaming
- development work: implement data process functions atop Spark streaming
Considerations

Reliability
For example
- be aware of data source exception or source relocation, and automatically/manually restart data capture tasks or redirect data source;
- monitor data quality and latency;
Scalability
- be aware of data source load pressure, and automatically/manually scale out data capture tasks;
Maintainability
- GUI for system monitoring, data quality check, latency statistics etc.;
- GUI for configuring data capture task scale out
Other CDC solutions

Databus (linkedIn): no native support for PG
Wormhole (facebook): not opensource
**Sherpa (yahoo!) **: not opensource
BottledWater (confluent): postgres Only (NOT maintained any more!!)
Maxwell: mysql Only
Debezium (redhat): good
Mongoriver: only for MongiDB
GoldenGate (Oracle): for Oracle and mysql, free but not opensource
Canal & otter (alibaba): for mysql world replication
相关阅读:
selenium 操作过程中，元素标红高亮的两种实现方式
 python pytest测试框架介绍五---日志实时输出
 pytest 3.9在python 2.7下的一个bug
Qt assis tant 帮助集合文档 -由.qhcp生成.qhc
Qt assistant .qch显示乱码问题
 qhelpgenerator 由qhp生成qch过程碰到的问题记录
 Qt creator新建widget项目....no valid kits found.....
Qt creator 账号
 Qt 写Excel
Qt获取主窗口
原文地址：https://www.cnblogs.com/luweiseu/p/7699004.html

Debezium for PostgreSQL to Kafka

Context and Problem

CQRS and Event Sourcing Pattern

Ingest Solutions - dual writes

Dual Write

Business log

Ingest Solutions - database operations

Snapshot

Data offload

Ingest Solutions - capture data change

process only “diff” of changes

Benefits:

Ingest Solutions - wrapup

Open Source for Postgres to Kafka

Debezium for Postgres Architecture

Considerations

Other CDC solutions