Moving from data swamp to smart data
(Editor's note: William Trout will speak on the topic "From data swamp to smart data" at the MDM & Data Governance Summit in Chicago, July 11-13).
Over the last decade, the increased use of unstructured and alternative data, and the ascendance of the cloud, has posed an overt challenge to the relational database model based on structured inputs and batch transmission. Transition from the classic SQL on Hadoop database toward the unstructured, real time data hub or lake has come with costs, however.
These costs are best understood within the context of a four stage data lifecycle comprising capture, transformation, extraction and delivery. Challenges related to capture and transformation revolve around inconsistent data quality and manual intervention requirements. Extraction and delivery processes, on the other hand, tend to founder on the expectations of the end user or business unit.
End user dissatisfaction with existing business intelligence processes and the need for more timely delivery of pricing, operational and analytical insight has spurred a generalized reliance on workarounds, which may include the unauthorized input and removal of data. The knock-on effects of these workarounds have in turn fostered desire for an end-to-end, all embracing data solution.
Breaking Down the Silos
The process of eliminating data marts (and the superstructure repositories that contain them) presupposes massive resource deployment as well as a philosophical about-face. Resource deployment extends beyond data clean up and transition from legacy architecture to the investment required to build and support the data lake structure. The philosophical about-face relates to the fundamental purpose of data within an organization.
In this new world data management is not a top-down process designed for a single use case. Rather, improved data management should enable the creation of a so-called golden source, in which universally accepted and accessible data is summoned by business partners according to their needs.
Look Before You Leap
The data lake that is able to store and process data in its native or raw format represents the latest iteration of this vision, but not the end state. Elimination of the ETL process as part of the transition to a unitary, schema-on-read data lake supports greater data volume but means the absence of any hygiene enforcement strategy.
Flexibility around data may be offset by the difficulty of matching it within the logical data model. Moreover, the ongoing requirements of maintaining the logical data model and semantic layer can be immense.
Ultimately, in many organizations, what purports to be a data lake will in fact be a witch’s brew of unstructured and structured data models; in essence not that different from the centralized data repository it was meant to replace. Firms have found that data linkage and lineage problems tend to become more intractable in the data lake environment, all the more in a mobile, social and IoT-inflected (think sensor data) world in which structured, unstructured and semi-structured data sets converge.
Enter the Knowledge Graph
Recently, an alternative to the relational data base and the Hadoop based data lake has emerged. The appeal of the knowledge graph as launched by firms like Thomson Reuters and Amazon (“Neptune”) lies in its simplicity and flexibility of deployment.
In particular, the knowledge graph uses semantic attribution to make sense of information rather than hang data on a common identifier. As such, it serves as a catalyst for natural language processing (NLP), natural language generation (NLG) and other machine learning and AI based tools.
The knowledge graph is best understood in relation to Internet search, in that is able to “tag” text-based or unstructured information and classify or assign it relationships. The ability to categorize information hierarchically (e.g., Skidmore, Owings and Merrill designed the Sears Tower, which is a building, which is in Chicago, which is a city) and in plain English facilitates navigation and gives the data its meaning.
The benefits of the knowledge graph center on the ease of use and expandability. The business user is able to visualize data and perform analytics for lead generation without having to SQL query or use joins to connect scattered or buried data.
At the same time, graph technology turns the golden source concept on its head by affording the end user multiple perspectives on the truth. The extent to which machine learning technologies are baked into graph technology is a lagniappe.