Why Does Data Lineage Matter?
While there are many automated approaches to discovering the location of sensitive data in emails and files when they are at rest or in motion, there is not much available when it comes to structured data as it rests and moves between databases. The process of discovering the location of sensitive data hidden inside structured data like databases has proven to be a major problem for most companies. The fact is that most companies don't have a comprehensive data map that shows their data lineage. As a result, to pass an audit or meet regulatory requirements, companies are forced to document their data lineage in detail for the first time. Unfortunately, while there are tools to show data lineage once you know it, the difficulty resides in finding a way to get the information to populate these tools in the first place. The most popular approach is simply to hire an army of system integrators or data analysts to document the environment by hand. This is an extremely expensive, slow and error-prone approach - by the time these teams have finished documenting an enterprise's data relationships, company processes, data and data lineage have already changed. The result is that mapping data frequently becomes the reason that data lineage projects are severely delayed or even cancelled. The fact is that oftentimes, these projects are just too expensive for IT to justify to the business.
Lots of Technology Can Get You Part of the Way
So what are C-level executives supposed to do? Ignoring the problem is not an option. Someday government regulators, the company's board, or an attorney will be at the door with a court order or a subpoena demanding a full and accurate map or portion of a map of your organization's data.
As mentioned above, there are some very good solutions for discovering and managing so-called "unstructured" data like emails and files. However, when it comes to structured data, the most popular approaches come up short of the mark.
For example, there are a number of metadata repository products in the market that help display lineage for structured data. The bad news is that these metadata repository products first have to be populated with metadata that describes the business rules and transformations between systems. To solve this problem, metadata repositories have created links to traditional data integration tools to pull the relationship metadata from the commercial data integration tools. And while traditional data integration tools, such as ETL, EAI, data cleansing and data profiling, contain some of the necessary information, they tend to account for a relatively small percentage of the actual data movement that occurs in most companies. Because most companies still move data through customer application code or SQL queries or scripting languages, over 80 percent of the data lineage metadata isn't available in the commercial data integration tools.
The result is that the critical problem that must be solved to successfully complete a data lineage project is discovery! Because the traditional data integration tools do nothing to help discover the data location, business rules, complex transformations and data relationships that describe how data is moving and how it is transformed as it moves, companies are still forced into the time-consuming and error-prone manual process of analyzing data by hand.
This is exactly where most data lineage projects fall apart. If you only have the lineage for 10 to 20 percent of your structured data, you have a solution that doesn't provide visibility into the vast majority of critical systems in the company. Management looks at this kind of project as almost useless, and the cost to bring in an army of data analysts to document everything is prohibitively expensive.
Where Do You Turn?
In an ideal world, the way to solve this problem would be to automate the process of discovering the data lineage between systems. Rather than dance around the issue, you could analyze the data in multiple structured data sources, run them through a sophisticated comparison engine that discovered the business rules and complex transformations between data sources, then automatically generate XML that can be input into any metadata repository or data lineage tool.
Fortunately, there is a new class of software that helps address and automate this process. Automated "data relationship discovery" software now exists that crawls through multiple datasets simultaneously, automatically finds relationships and complex transformations based on the implicit business rules hidden in the data, organizes them for human analysis and finds the exceptions that exist between datasets, allowing for accurate and trustworthy data lineage information. Think of data relationship discovery as a giant leap past data profiling, where the output is not just a bunch of statistics, but actionable code that describes the relationships that relate two data sources together, including complex business rules that include substrings, concatenations, arithmetic equations and even sophisticated case statements.
Using this new approach, companies are able to complete data lineage discovery and documentation in a fraction of the time when compared to using traditional tools. In fact, in actual deployments, this new approach of automated data relationship discovery is 5 times faster than profiling and manual discovery. The result is more accurate data lineage information in less time at less cost, making data lineage projects fast, feasible and affordable and attractive for IT to present to the business.












