Data lakes are reaching adolescence. The wild emotional years and peer pressure aren’t quite over yet. Many mainstream organizations are still having enthusiastic growing pains that end in embarrassment. Channeling Hadoop and data lake enthusiasm in the right direction is a huge CIO challenge. Here are a few paths toward data lake maturity.
Adolescent enthusiasm is no way to run a business. Every data lake project must have a line of business champion and goal. Just like every other project. Projects without a funding business sponsor tend to fail. Improved customer experience, cost efficiency, and new business opportunities should top the list of data lake projects.
Think of the ROI plan as guide rails on a mountain road keeping projects on the road. Million dollar projects that the business people refuse to touch are career limiting. And for the first few projects, skip ‘cost efficiency’ justifications that only benefit IT. Those projects are often false growth spurts.
One pitfall that Data Lake enthusiasts fall into is “put everything in the data lake.” We know one worldwide corporate giant that got into colossal expense this way. Hundreds of times a day they store a terabyte file in the lake. Hadoop then replicates that file twice for availability. Then they derived seven files from the first. That’s eight terabytes times three terabytes. Multiply that by dozens of files daily. Soon the data lake is twenty petabytes and a thousand servers.
Nope, disk storage is not free --especially in the data lake. Start here: every file placed in the data lake must be a line of business necessity. Avoid polluting the lake. Next minimize derivatives. That means programmers must coordinate designs. That’s common sense project management. Optimizing spending from the beginning is easier and cheaper than cleaning up a swamp.
A first principle of data lakes is to capture the original raw data files. Raw files means the data will have flaws, inconsistencies, and missing values. But dirty data begets muddy answers. Broken data begets broken answers. Refuse to clean up the data and the business users will refuse to use it. Thus data quality processing is not optional.
Raw data needs refining. This is where data warehouse people and tools are a huge accelerator. But data lakes fail at being the data warehouse. Therefore, data lakes don’t need extensive data cleaning. Data that’s less than perfect still has great value. If you can get that value at lower expense, do it. Balancing data trust versus investments is sensible data governance. So spend what's necessary and no more. Spoiler: programmers call this schema-on-read.
Some insane parents teach their child to swim by throwing them in the lake. “Sink or swim” they holler. Those parents end up jumping in to save them. Corollary: CIOs spend tons of money fixing data lakes. This last advice is obvious: hire a guide, a Sherpa, for the first few data lake projects.
Bring in vetted data lake experts to help IT build towards the right goals and foundations. It often takes an outsider to drive ROI common sense among rampant enthusiasm. Experts ensure the architectural foundation holds up over time. If your Sherpa says “the data lake can do everything” or “we can replace the data warehouse,” fire them. Those consultants don’t have the promised experience. Best practices are still emerging. Data lake services people I know are pushing data lake maturity forward. These consultants have done enough projects to know the golden path to positive outcomes. They often bring years of data governance and security experience as well.
Software vendors are struggling to repurpose their data management tools for the data lake. It will take a few more major releases before the proprietary software is mature for data lakes.
Open source software lags proprietary tools by as much as a decade.
It’s easier to add new features to proprietary software than to reinvent decades of experience in open source. But I still hold onto my enthusiasm for open source.
Visionary ideas don’t spring to life fully mature. They evolve in spurts, missteps, and innovations. Channel all that data lake enthusiasm towards a strong, mature, innovative foundation.
(About the author: Daniel Graham is general manager, enterprise systems, at Teradata Corporation)
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access