(This article, the first in a three-part series, was excerpted from Streaming Change Data Capture: A Foundation for Modern Data Analytics. This book was published in May by O’Reilly Media and is available now for download.)
Data is creating massive waves of change and giving rise to a new data-driven economy that is only in its infancy. Organizations in all industries are shifting their business models to monetize data, understanding that doing so is critical to competition and even survival. There is tremendous opportunity as applications, instrumented devices and web traffic are throwing off reams of 1s and 0s, rich in analytics potential.
These analytics initiatives can reshape sales, operations and strategy on many fronts. Real-time processing of customer data can create new revenue opportunities. Tracking devices with Internet of Things (IoT) sensors can improve operational efficiency, reduce risk and yield new analytics insights. New artificial intelligence (AI) approaches such as machine learning can accelerate and improve the accuracy of business predictions. Such is the promise of modern analytics.
However, these opportunities change how data needs to be moved, stored, processed and analyzed, and it’s easy to underestimate the resulting organizational and technical challenges. From a technology perspective, to achieve the promise of analytics, underlying data architectures need to efficiently process high volumes of fast-moving data from many sources. They also need to accommodate evolving business needs and multiplying data sources.
To adapt, IT organizations are embracing data lake, streaming and cloud architectures. These platforms are complementing and even replacing the enterprise data warehouse (EDW), the traditional structured system of record for analytics.
Enterprise architects and other data managers know firsthand that we are in the early phases of this transition, and it is tricky stuff. A primary challenge is data integration—the second most likely barrier to Hadoop Data Lake implementations, right behind data governance, according to a recent TDWI survey, “Data Lakes: Purposes, Practices, Patterns and Platforms.”
IT organizations must copy data to analytics platforms, often continuously, without disrupting production applications (a trait known as zero-impact). Data integration processes must be scalable, efficient and able to absorb high data volumes from many sources without a prohibitive increase in labor or complexity. The table below summarizes the key data integration requirements of modern analytics initiatives.
All this entails careful planning and new technologies because traditional batch-oriented data integration tools do not meet these requirements. Batch replication jobs and manual extract, transform and load (ETL) scripting procedures are slow, inefficient and disruptive. They interrupt production, tie up talented ETL programmers and create network and processing bottlenecks. They cannot scale sufficiently to support strategic enterprise initiatives. Batch is unsustainable in today’s enterprise.
The Alternative to Batch: Change Data Capture Technology
A foundational technology for modernizing your environment is change data capture (CDC) software, which enables continuous incremental replication by identifying and copying data updates as they take place. When designed and implemented effectively, CDC can meet today’s scalability, efficiency, real-time and zero-impact requirements.
Without CDC, organizations usually fail to meet modern analytics requirements. They must stop or slow production activities for batch runs, hurting efficiency and decreasing business opportunities. They cannot integrate enough data, fast enough, to meet analytics objectives. They miss business opportunities, lose customers and break operational budgets.
CDC continuously identifies and captures incremental changes to data and data structures (a.k.a. schemas) from a source such as a production database. CDC arose two decades ago to help replication software deliver real-time transactions to data warehouses, where the data is then transformed and delivered to analytics applications. Thus, CDC enables efficient, low-latency data transfer to operational and analytics users with low production impact.
The first method used for replicating production records (i.e., rows in a database table) to an analytics platform is batch loading, also known as bulk or full loading. This process creates files or tables at the target, defines their “metadata” structures based on the source, and populates them with data copied from the source as well as the necessary metadata definitions.
Batch loads and periodic reloads with the latest data take time and often consume significant processing power on the source system. This means administrators need to run replication loads during “batch windows” of time in which production is paused or will not be heavily affected. Batch windows are increasingly unacceptable in today’s global, 24×7 business environment.
CDC has three fundamental advantages over batch replication:
- It enables faster and more accurate decisions based on the most current data; for example, by feeding database transactions to streaming analytics applications.
- It minimizes disruptions to production workloads.
- It reduces the cost of transferring data over the wide area network (WAN) by sending only incremental changes.
Together these advantages enable IT organizations to meet the real-time, efficiency, scalability, and low-production impact requirements of a modern data architecture.
In the next article in this series, I’ll examine how CDC is empowering enterprises to capture new data and analytics value.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access