Why Does Data Lineage Matter?

The term data lineage might bring to mind family trees and dry legal searches, however, for companies that are concerned about critical and sensitive data, the topic of data lineage has become increasingly important. Today's environment of corporate regulation and stringent data governance is intended to reduce the many risks associated with managing data, such as security, privacy and intentional and accidental exposure of sensitive data. Managing these risks is necessary and extremely useful, but there's one gaping hole in the promise of risk mitigation: in order to achieve compliance, whether external or internal, companies must be able to demonstrate where data comes from, where it flows to, and how it's transformed as it travels through the enterprise. After all, you can't manage what you can't find. This is what data lineage is all about; documenting where data is, and how it flows so you can manage and secure it appropriately as it moves across the corporate network.

While there are many automated approaches to discovering the location of sensitive data in emails and files when they are at rest or in motion, there is not much available when it comes to structured data as it rests and moves between databases. The process of discovering the location of sensitive data hidden inside structured data like databases has proven to be a major problem for most companies. The fact is that most companies don't have a comprehensive data map that shows their data lineage. As a result, to pass an audit or meet regulatory requirements, companies are forced to document their data lineage in detail for the first time. Unfortunately, while there are tools to show data lineage once you know it, the difficulty resides in finding a way to get the information to populate these tools in the first place. The most popular approach is simply to hire an army of system integrators or data analysts to document the environment by hand. This is an extremely expensive, slow and error-prone approach - by the time these teams have finished documenting an enterprise's data relationships, company processes, data and data lineage have already changed. The result is that mapping data frequently becomes the reason that data lineage projects are severely delayed or even cancelled. The fact is that oftentimes, these projects are just too expensive for IT to justify to the business.

Lots of Technology Can Get You Part of the Way

So what are C-level executives supposed to do? Ignoring the problem is not an option. Someday government regulators, the company's board, or an attorney will be at the door with a court order or a subpoena demanding a full and accurate map or portion of a map of your organization's data.

As mentioned above, there are some very good solutions for discovering and managing so-called "unstructured" data like emails and files. However, when it comes to structured data, the most popular approaches come up short of the mark.

For example, there are a number of metadata repository products in the market that help display lineage for structured data. The bad news is that these metadata repository products first have to be populated with metadata that describes the business rules and transformations between systems. To solve this problem, metadata repositories have created links to traditional data integration tools to pull the relationship metadata from the commercial data integration tools. And while traditional data integration tools, such as ETL, EAI, data cleansing and data profiling, contain some of the necessary information, they tend to account for a relatively small percentage of the actual data movement that occurs in most companies. Because most companies still move data through customer application code or SQL queries or scripting languages, over 80 percent of the data lineage metadata isn't available in the commercial data integration tools.

The result is that the critical problem that must be solved to successfully complete a data lineage project is discovery! Because the traditional data integration tools do nothing to help discover the data location, business rules, complex transformations and data relationships that describe how data is moving and how it is transformed as it moves, companies are still forced into the time-consuming and error-prone manual process of analyzing data by hand.

This is exactly where most data lineage projects fall apart. If you only have the lineage for 10 to 20 percent of your structured data, you have a solution that doesn't provide visibility into the vast majority of critical systems in the company. Management looks at this kind of project as almost useless, and the cost to bring in an army of data analysts to document everything is prohibitively expensive.

Where Do You Turn?

In an ideal world, the way to solve this problem would be to automate the process of discovering the data lineage between systems. Rather than dance around the issue, you could analyze the data in multiple structured data sources, run them through a sophisticated comparison engine that discovered the business rules and complex transformations between data sources, then automatically generate XML that can be input into any metadata repository or data lineage tool.

Fortunately, there is a new class of software that helps address and automate this process. Automated "data relationship discovery" software now exists that crawls through multiple datasets simultaneously, automatically finds relationships and complex transformations based on the implicit business rules hidden in the data, organizes them for human analysis and finds the exceptions that exist between datasets, allowing for accurate and trustworthy data lineage information. Think of data relationship discovery as a giant leap past data profiling, where the output is not just a bunch of statistics, but actionable code that describes the relationships that relate two data sources together, including complex business rules that include substrings, concatenations, arithmetic equations and even sophisticated case statements.

Using this new approach, companies are able to complete data lineage discovery and documentation in a fraction of the time when compared to using traditional tools. In fact, in actual deployments, this new approach of automated data relationship discovery is 5 times faster than profiling and manual discovery. The result is more accurate data lineage information in less time at less cost, making data lineage projects fast, feasible and affordable and attractive for IT to present to the business.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access