Continue in 2 seconds

Data Migration Strategies

  • June 01 1999, 1:00am EDT

At any given time, according to industry analyst estimates, roughly two-thirds of the Fortune 1000/Global 2000 are engaged in some form of data conversion project ­ including migration from legacy systems to packaged applications, data consolidations, data quality improvements and creation of data warehouses and data marts. These projects are driven by internal financial pressures, increasing competition, ongoing deregulation, industry consolidation due to mergers and acquisitions and Year 2000 and European currency concerns. According to recent studies conducted by The Standish Group, a research advisory firm based in Dennis, Massachusetts, 15,000 of these types of projects with budgets of $3 million or greater will begin in 1999 at a total cost of $95 billion. AMR Research estimates that ERP conversion projects alone will generate $52 billion by 2002, and Gartner Group/Dataquest forecasts that $2.6 billion will be spent on data warehousing software in 1999, growing to $6.9 billion by 2002.

Project Failure as Norm?

"Not only time, but company credibility, is regarded as a corporate asset. Running the risk of project default and/or cancellation is all too commonplace, and extensive investment in refining insufficient data facilities is costly and counterproductive. This approach alone can jeopardize a company's timetable and competitiveness."

Jim Johnson, Chairman,
The Standish Group

Unfortunately, most data migration projects don't go as smoothly as anticipated. According to The Standish Group, in 1998, 74 percent of all IT projects either overran or failed, resulting in almost $100 billion in unexpected costs. Of the 15,000 data migration projects that are starting in 1999, as many as 88 percent will either overrun or fail.

One of the primary reasons for this extraordinary failure rate is the lack of a thorough understanding of the source data early on in these projects. Conventional approaches to data profiling and migration can create nearly as many problems as they resolve ­ data not loading properly, poor quality data and compounded inaccuracies, time and cost overruns and, in extreme cases, late- stage project cancellations. The old adage "garbage in, garbage out" is the watchword here.

Understanding Your Data: The First Essential Step

Before undertaking large scale legacy-to-x application data migrations, data analysts need to learn as much as possible about the data they plan to move. If, for example, you have a variety of pegs that you'd like to fit into round holes, it's best to find out which ones are already round and which ones are square, triangular, rectangular, etc. Considering the magnitude of an organization's data, the task of the data analyst to obtain this knowledge is overwhelming, to say the least. IT organizations can begin their data analysis by implementing a two-step process: data profiling and mapping. Data profiling involves studying the source data thoroughly to understand its content, structure, quality and integrity. Once the data has been profiled, an accurate set of mapping specifications can be developed based on this profile ­ a process called data mapping. The combination of data profiling and mapping comprises the essential first step in any successful data migration project and should be completed prior to attempting to extract, scrub, transform and load the data into the target database.

Conventional Techniques in Data Migration: Problems and Pitfalls

The conventional approach to data profiling and mapping starts with a large team of people (data and business analysts, data administrators, database administrators, system designers, subject matter experts, etc.). These people meet in a series of joint application development (JAD) sessions and attempt to extract useful information about the content and structure of the legacy data sources by examining outdated documentation, COBOL copy books, inaccurate meta data and, in some cases, the physical data itself. Typically, this is a very labor- intensive process supplemented, in some cases, by semi-automated query techniques. Profiling legacy data in this way is extremely complex, time- intensive and error-prone. Once the process is complete, only a limited understanding of the source data is achieved.

At that point, according to the project flow chart, the data analyst moves on to the mapping phase. However, since the source data is so poorly understood and inferences about it are largely based on assumptions rather than facts, this phase typically results in an inaccurate data model and set of mapping specifications. Based on this information, the data is extracted, scrubbed, transformed and loaded into the new database.

Not surprisingly, in almost all cases, the new system doesn't work correctly the first time. Then the rework process begins: redesigning, recoding, reloading and retesting. At best, the project incurs significant time and cost overruns. At worst, faced with runaway costs and no clear end in sight, senior management cancels the project, preferring to live with an inefficient but partially functional information system rather than incur the ongoing costs of an "endless" data migration project.

Strategies for Data Migration: 6 Steps to Preparing Your Data

Data profiling and mapping consist of six sequential steps, three for data profiling and three for data mapping, with each step building on the information produced in the previous steps. The resulting transformation maps, in turn, can be used in conjunction with third-party data migration tools to extract, scrub, transform and load the data from the old system to the new system.

Data sources are profiled in three dimensions: down columns (column profiling) ; across rows (dependency profiling); and across tables (redundancy profiling).

Column Profiling. Column profiling analyzes the values in each column or field of source data, inferring detailed characteristics for each column, including data type and size, range of values, frequency and distribution of values, cardinality and null and uniqueness characteristics. This step allows analysts to detect and analyze data content quality problems and evaluate discrepancies between the inferred, true meta data and the documented meta data.

Dependency Profiling. Dependency profiling analyzes data across rows ­ comparing values in every column with values in every other column ­ and infers all dependency relationships that exist between attributes within each table. This process cannot be accomplished manually. Dependency profiling identifies primary keys and whether or not expected dependencies (e.g., those imposed by a new application) are supported by the data. It also identifies "gray-area dependencies" ­ those that are true most of the time, but not all of the time, and are usually an indication of a data quality problem.

Redundancy Profiling. Redundancy profiling compares data between tables of the same or different data sources, determining which columns contain overlapping or identical sets of values. It looks for repeating patterns among an organization's "islands of information" ­ billing systems, sales force automation systems, post-sales support systems, etc. Redundancy profiling identifies attributes containing the same information but with different names (synonyms) and attributes that have the same name but different business meaning (homonyms). It also helps determine which columns are redundant and can be eliminated and which are necessary to connect information between tables. Redundancy profiling eliminates processing overhead and reduces the probability of error in the target database. As with dependency profiling, this process cannot be accomplished manually.

Figure 1: Key steps in data profiling and mapping

Once the data profiling process is finished, the profile results can be used to complete the remaining three data mapping steps of a migration project: normalization, model enhancement and transformation mapping.

Normalization. By building a fully normalized relational model based on and fully supported by the consolidation of all the data, the data model will not fail.

Model Enhancement. This process involves modifying the normalized model by adding structures to support new requirements or by adding indexes and denormalizing the structures to enhance performance.

Transformation Mapping. Once the data model modifications are complete, a set of transformation maps can be created to show the relationships between columns in the source files and tables in the enhanced model, including attribute-to-attribute flows. Ideally, these transformation maps facilitate the capture of scrubbing and transformation requirements and provide essential information to the programmers creating conversion routines to move data from the source to the target database.

Developing an accurate profile of existing data sources is the essential first step in any successful data migration project. By executing a sound data profiling and mapping strategy, small, focused teams of technical and business users can quickly perform the highly complex tasks necessary to achieve a thorough understanding of source data ­ a level of understanding that simply cannot be achieved through conventional processes and semi-automated query techniques.

Better Decisions

By following these six steps in data profiling and mapping, companies can take their data migration projects to completion successfully the first time around ­ eliminating extensive design rework and late-stage project cancellations. A good data profiling and mapping methodology will take into account the entire scope of a project, even warning IT management if the business objectives of the project are not supported by the data. Data profiling and mapping, if done correctly, can dramatically lower project risk, enabling valuable resources to be redirected to other, more fruitful projects. Finally, it will deliver higher data and application quality, resulting in more informed business decisions, which typically translate into greater revenues and profits.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access