How change data capture technology drives modern data architectures
(This article, the second in a three-part series, was excerpted from “Streaming Change Data Capture: A Foundation for Modern Data Analytics.” The book was published in May by O’Reilly Media and is available now for download.)
Change data capture software enables continuous incremental replication by identifying and copying data updates as they take place. When designed and implemented effectively, CDC can meet today’s scalability, efficiency, real-time and zero-impact requirements. Without CDC, organizations usually fail to meet modern analytics requirements.
The diagram below illustrates the architectures in which data workflow and analysis take place and their integration points with CDC. The methods and terminology for data transfer tend to vary by target. While the transfer of data and metadata into a database involves simple replication, a more complicated extract, transform and load (ETL) process is required for data warehouse targets.
Data lakes, meanwhile, typically can ingest data in its native format. Finally, streaming targets require source data to be published and consumed in a series of messages. Any of these four target types can reside on-premises, in the cloud, or in a hybrid combination of the two.
In practice, most enterprises have a patchwork of the architectures described here, as they apply different engines to different workloads. A trial-and-error learning process, changing business requirements, and the rise of new platforms all mean that data managers will need to keep copying data from one place to another. Data mobility will be critical to the success of modern enterprise IT departments for the foreseeable future.
The following case studies illustrate the role of CDC in enabling scalable and efficient analytics architectures that do not affect production application performance.
By moving and processing incremental data and metadata updates in real time, these organizations have reduced or eliminated the need for resource-draining and disruptive batch (a.k.a. full) loads. They are siphoning data to multiple platforms for specialized analysis on each, and consuming CPU and other resources in a balanced and sustainable way.
Case Study 1: Streaming to a Cloud-Based Lambda Architecture
First let’s look at a Fortune 500 healthcare solution provider to hospitals, pharmacies, clinical laboratories, and doctors that is investing in cloud analytics to identify opportunities for improving quality of care.
The analytics team for this company, which we’ll call “GetWell,” is using CDC software to accelerate and streamline clinical data consolidation from on-premises sources such as SQL Server and Oracle to a Kafka message queue that in turn feeds a Lambda architecture on Amazon Web Services (AWS) Simple Storage Service (S3). Log-based CDC has enabled them to integrate this clinical data at scale from many sources with minimal administrative burden and no impact on production operations.
GetWell data scientists conduct therapy research on this Lambda architecture, using both historical batch processing and real-time analytics. In addition to traditional SQL-structured analysis, they run graph analysis to better assess the relationships between clinical drug treatments, drug usage, and outcomes. They also perform natural-language processing (NLP) to identify key observations within physician’s notes and are testing other new AI approaches such as machine learning to improve predictions of clinical treatment outcomes.
Case Study 2: Streaming to the Data Lake
Decision makers at an international food industry leader, which we’ll call “Suppertime,” needed a current view and continuous integration of production capacity data, customer orders and purchase orders to efficiently process, distribute and sell tens of millions of chickens each week. But Suppertime struggled to bring together these large datasets, which were distributed across several acquisition-related silos in SAP enterprise resource planning (ERP) applications. Using nightly batch replication, they were unable to match orders and production line-item data fast enough. This slowed plant operational scheduling and prevented sales teams from filing accurate daily reports.
To streamline the process, Suppertime converted to a new Hadoop data lake based on the Hortonworks data platform and CDC. It now uses CDC to efficiently copy all of the SAP record changes every five seconds, decoding that data from complex source SAP pool and cluster tables. CDC injects this data stream, along with any changes to the source metadata and data definition language (DDL) changes, to a Kafka message queue that feeds HDFS and HBase consumers that subscribe to the relevant message topics (one topic per source table).
After the data arrives in HDFS and HBase, Spark in-memory processing helps match orders to production on a real-time basis and maintain referential integrity for purchase order tables. As a result, Suppertime has accelerated sales and product delivery with accurate real-time operational reporting. It has replaced batch loads with CDC to operate more efficiently and more profitably.
Case Study 3: Streaming, Data Lake, and Cloud Architecture
Another example is a U.S. private equity and venture capital firm that built a data lake to consolidate and analyze operational metrics from its portfolio companies. This firm, which we’ll call “Startup Backers,” opted to host its data lake in the Microsoft Azure cloud rather than taking on the administrative burden of an on-premises infrastructure.
CDC is remotely capturing updates and DDL changes from source databases (Oracle, SQL Server, MySQL, and DB2) at four locations in the United States. CDC then sends that data through an encrypted File Channel connection over a wide area network (WAN) to CDC in the Azure cloud. This replication instance publishes the data updates to a Kafka message broker that relays those messages in the JSON format to Spark.
The Spark platform prepares the data in micro-batches to be consumed by the HDInsight data lake, SQL data warehouse, and various other internal and external subscribers. These targets subscribe to topics that are categorized by source tables. With this CDC-based architecture, Startup Backers is now efficiently supporting real-time analysis without affecting production operations.
In the final article of this series, I’ll examine how other implementations of CDC technology are empowering enterprises to capture new data and analytics value.