How to keep from drowning in your data lake
The use of semi-structured data in big data analysis is growing, spurred by the growth of markets like the Internet of Things. However, the increased popularity of data lakes like Hadoop and non-relational databases like MongoDB has exposed a unique challenge to companies: how to make it easy to analyze incredible amounts of non-relational data.
In order extract value from this type of data, traditional analytics or business intelligence (BI) tools require it to be fully homogenized and fixed into a schema, whose structure depends on the way that data engineers expect it to be queried. This is a long and expensive process, and reports in the media and my own interaction with IT teams and data scientists tell me as much of 80 percent of the work involved in big data analysis is prepping and maintaining data models.
So how did we get here? The increased use of non-relational stores is primarily due to the flexibility they deliver in comparison to relational databases. With non-relational applications generating terabytes of data, it’s becoming increasingly difficult to manage and catalogue that data in relational databases.
That said, relational databases (Oracle, MySQL, etc.) currently in use have been the database platforms of choice among enterprises for decades. Users are experienced with the analytics tools (Tableau, PowerBI, etc.) used with these databases, and whether it’s driven by the comfort of using a familiar tool or by the “golden handcuffs” of software licensing, they want to use those legacy BI tools with their unstructured databases.
The problem is these tools were never built to work with the unstructured data models like JSON and XML that power today’s non-relational stores. But IT teams are determined to make them work, leading to a variety of strategies to “prep” non-relational data for injection into a relational database for analysis by a legacy BI tool.
As expected, these ad-hoc approaches slow down the analytics process, require significant maintenance and recoding work (particularly as data scientists request new analytic report formats) and generally provide poor results. For data-driven enterprises, increasing the organization’s ability to quickly access data and gain insight from it is vital to future success. But the one size fits all approach to data analysis is a serious roadblock, frustrating IT teams and data scientists alike.
Until IT and data scientists realize their legacy BI tools aren’t prepared to handle the realities of today’s non-structured big data analytics projects, they’ll continue to burn time and money wrestling to get a square peg (non-relational data) into a round hole (relational databases). Wouldn’t that time and expertise be better spent determining what the data is trying to say?