The Evolution of ETL

In today’s world, data is generated and collected faster than ever before. In the past, most data was generated by humans, typing data into application forms, ringing up purchases at point of sale machines and so on. Now, the majority of the data created is machine generated, collected in application logs and produced by sensors. The verbosity and sampling rate of these sources has exploded as computing capacity has expanded, storage has become cheaper and the business value of this data has increased. To meet these extreme challenges, a new breed of platforms has been developed including Hadoop, a wide range of NoSQL stores and cloud-enabled infrastructure.

As experience tells us, just putting the data somewhere is not the goal. Extracting, transforming and loading this data to an analytic system is what brings true life and enablement to the data we have collected. For this purpose, we have our trusted workhorse, the ETL platform. But it, too, must evolve in order to serve in this challenging new world.

Evolution 1: Pushing the Processing Down

With data sets becoming larger and more complex, it is increasingly important that processing remains close to the data. This concept helps assure a scalable/divisible processing model. Additionally, it reduces the bottlenecks of network latency. The model of picking up data, processing on another server and then putting it back down must be used with greater care because it is too costly in a high-volume environment. Local data processing performance has been thoroughly proven in Hadoop and MapReduce. We will definitely see more ETL tools leveraging Hadoop to perform their processing, and several mainstream ETL tools are already executing on this strategy.

Evolution 2: Catching the Data In-Stream

Not only have data sets become larger, but the need to gain insights from it more quickly has increased. End-of-day batch processing has become less desirable, and in many cases will not support an organization’s critical decision support functions. 

This need to gain insights from data in near-real time has born a new breed of processing engines such as Storm, Akka, Esper. These engines perform very similar functions to ETL platforms, however they more often interact with message queues than databases and they process continuously rather than as specific batch windows. In a sophisticated application infrastructure with an SOA architecture and a robust message queue, we can consume this data in real time without waiting for it to land in a target system for batch processing later. The analytics system now becomes a more integrated part of the strategic application architecture rather than merely a downstream consumer.

We will see these engines continue to evolve and gain popularity. Additionally, mainstream ETL tools will begin improving their real-time integrations, perhaps integrating with these engines.

Evolution 3: New Databases

One of the most exciting movements in big data has been NoSQL. Relational databases have served us well for years and will continue to have a place in our analytics architecture. However, there is now a diverse landscape of NoSQL databases that can better suit our needs, especially related to handling extreme data volumes.

ETL tools have a long history of developing adapters to interact with just about any system out there, so building integrations to popular NoSQL stores is a natural extension. However, this exercise will not be trivial.  Most ETL tools still think relationally in terms of normalizing to rows and columns, so the overall processing paradigm will need to evolve to take full advantage of the features of these databases. For example, the composite column in Cassandra or the link in Riak are fairly alien from a relational standpoint, and ETL tools will need to understand how they should be sourced, mapped to data pipeline operations and targeted.  Just as important, the distributed processing and map reduce capabilities of these databases will need to be integrated as well.

Evolution 4: Cloud

The growth of cloud-hosted infrastructure over the last couple years has been amazing. At long last, the capex/opex objection is being won by the indisputable enablements of elastic computing. Historically, ETL tools have been one of the most burdensome components of analytic architecture from an infrastructure standpoint. Removing these infrastructure and scalability concerns from the equation will allow developers and business users to focus on how to leverage data instead of operational concerns on how it gets built. This is especially important when data is growing rapidly and unpredictably. We need to be able to easily adapt to an order of magnitude change in scale.

ETL tools will need to adapt to this environment by excising the last bits of their legacy heritage and fully embracing modern application architecture. Seamless provisioning, elastic scalability, flexible and simple partitioning schemas, shared state and distributed caches will all need to take the front seat in application roadmaps. Many major ETL vendors are moving in this direction.

It is an exciting time for the ETL market. Open source, cloud architecture and distributed processing will all take front and center in product strategy. In the coming years, we will finally see some exciting innovation in a space that has been relatively docile for the past decade. 

For reprint and licensing requests for this article, click here.