It may be very tempting for an organization to think that it can and should build a data lake from scratch using its own people and skills for relatively small money and in just a couple weeks. After all, Hadoop and its associated open source projects are free and the cost to stand up a Hadoop cluster on commodity hardware is small (at least by corporate IT spending standards). How hard can it be to load data in Hadoop and give users some tools to start wrangling?

The reality however is that data lake projects aren’t free – even when built on Hadoop and free open source software. Hadoop developers are notoriously expensive to hire and experienced ones are difficult to find. Delays in development efforts not only increase direct project personnel costs, but also postpone sun setting of other costly enterprise data management and infrastructure applications and platforms.

Harder to measure is the financial impact on the business as data analysts and business users make due with inadequate data to make decisions as delivery of the data lake falls farther and farther behind schedule.

“Free and Easy” Meet the Real World: Getting Your Data Business Ready

The idea that building a data lake from scratch should be easy and relatively inexpensive starts to crumble when the actual work of standing it up begins. That first stage in the data lake project – onboarding data into the lake – offers relevant lessons on why internal data lake projects often take significantly longer than expected and cost a lot more than initially planned.

Getting your Data “BUSINESS READY”:

Step 1 – Ingest the data into the lake

The initial thinking is that you simply “copy” data into HDFS with tools such as Apache Sqoop. What you quickly realize is that these relatively simple RDBMS sources, once “Sqooped” into HDFS, only solves part of the problem. What about history management for example?

You then try to tackle complex sources such as XML or Cobol-based mainframe data sets and quickly realize it is near impossible to get these right. There are no off-the-shelf, open source projects to handle these so you must custom code. Whether it’s with projects like Apache NiFi or Kafka, you will need to spend significant time on coding the character set conversion or XML normalization if you want business users to access this data.

Step 2 – Validate and profile each new source

In order to deem this newly sourced data “business ready,” you must provide some level of validation. If, for example, you have data with embedded delimiters, your business users will get incorrect results when using that data. Scarier still, they won’t even know they are getting wrong answers.

Along with some level of validation, capturing even basic data profiling statistics will enable users to quickly find what they need.

Again, just validating and profiling data upon sourcing it can be a large drawn out project on its own and is fraught with risk.

Step 3 – Automate

Finally, the data on-boarding processes developed for each data source need to be adapted to refresh the data lake with new data from each source on a periodic basis, supporting either a change data capture or total refresh model. These “production ready” on-boarding processes then need to be automated so that they can be repeated in a lights-out production environment.

Given the complexity and scale of these three data onboarding steps, it might take a skilled Hadoop programmer using hand-coded load jobs 4 months to load each source. That’s a lot of effort and lot of time just to get the data lake populated and ready for business.

Build vs. Buy – what is the faster, cheapest way to data lake?

What then are the time and cost tradeoffs of building a data lake from scratch using internal IT resources versus using a commercial product as a platform for a data lake?

One option would be to onboard the data using custom code written by Hadoop programs for that specific data lake project. This option would require 33 FTEs (full-time employees) to work for a full year to on-board all 100 sources into the data lake - at a total cost of nearly $5.8 million.

A second option would be to use a Hadoop-friendly ETL tool to support the ingest process. This would lower to 17 the number of FTEs needed to get the job done and allow for use less expensive ETL specialists. However the process would still take most of a year and the added cost of the software license for the ETL product would bring the total cost for this option to $2.69 million.

A third option would be to use a commercially available data lake management software product. This would add license cost for the data lake management software itself but it would dramatically reduce the time required to onboard each data source to roughly one week per source.

It would also allow the company to use lower cost staff like data analysts to onboard each source. In this scenario, all 100 sources could be loaded in 3 months by 8 FTEs at a total cost of $550,000.

TIME AND COST TO ONBOARD COMPLEX DATA INTO DATA LAKE

 

BUILD

BUY

Time and Cost to Onboard A Data Source

Pure Open Source

Hadoop + projects

Open Source w/ Tools

Informatica, Talend

Data Lake Management

Platform

1 source system

4 months
by 1 FTE
@ 175K/yr

 

= $58K

2 months
by 1 FTE
@ $150K/yr

 

= $30K

1 week
by 1 FTE
@ $150K/yr

 

=$3K

100 source systems

12 months
by 33 FTE
@ 175K/yr

 

= $5,775K

12 months
by 17 FTE
@ $150K/yr

 

= $2,550K

3 months
by 8 FTE
@ $150K/yr

 

=$300K

Plus Annual License

$0

$140K

$250K

TCO (annual)

$5,775K

$2,690K

$550K

 

The BUILD options for deploying a data lake not only cost 5-10 times more than the BUY approach, they also take much longer. And those differences are likely to continue over time, as the organization evolves and new requirements emerge to add additional data sources into the lake. This isn’t once and done; on-boarding goes on forever.

There is a reason most companies buy rather than build many of enterprise-scale business applications and data management platform that they use to run their business. Building best-in-class solutions to solve enterprise-scale problems is hard, expensive and time consuming. Companies like IBM, Oracle, SAP and others have been working on that for decades and they still have lots of work to do.

Sometimes it’s better to buy a packaged platform that already incorporates the functionality you need, captures best practices and can be up and running to deliver value to the business in a few weeks. And perhaps this is nowhere more true than with data lake projects where, as it turns out, free is really not free … and easy is a lot harder than you think.

(About the author: Dr. Paul Barth has spent decades developing advanced data and analytics solutions and is founder and chief executive officer at Podium Data).