This is an article from the August 2006 issue of DM Review's Extended Edition. Click on this link for more information on DMR Extended Edition or to download this entire issue in a PDF format.
InsightAmerica provides secure, reliable access to hundreds of national and state-specific databases, including identity data, property, addresses, phone numbers, motor vehicle and criminal records, in both online and batch mode. Companies and government organizations turn to the company for the latest, most up-to-date information on criminals, consumer contact information and other publicly available data. Without consistent, accurate and reliable data, the quality of the company's primary product is compromised.
InsightAmerica offers a suite of online, batch and XML API products to provide information to our client accounts. The information provided by InsightAmerica arrives from a variety of government and private-sector organizations. For example, each state has a different file for criminal arrest records, each of which must be integrated into InsightAmerica products every month. Complicating the process, approximately 10 to 20 percent may change layouts or formats between successive files. Data sources typically exceed 350 million records - a figure that only grows with time.
The company's eight-person data management team needed a method for uncovering and addressing the data integration challenge on an ongoing basis. Duplicate records were everywhere within our data stores, and it was difficult to apply standards as data entered the system.
The company decided to implement dfPower Studio, a data quality integration solution from DataFlux. dfPower Studio provides a design environment that allows data professionals at InsightAmerica to analyze incoming data as well as build and apply business rules to fix that data.
dfPower Studio helps the data management team to identify duplicate records from incoming data sources based on a weighted relevance scale. The technology uses its core matching engine during the profiling phase to find relationships between records, and InsightAmerica can tune the degree of specificity necessary to generate a match using this engine.
Once the technology identifies redundant or duplicate data, InsightAmerica utilizes DataFlux's "surviving record" feature. This capability helps maintain some data points from existing records while updating or augmenting the information with new data as it arrives.
Finally, the company uses DataFlux to codify and enforce business standards for data elements. Once standardized, it is easier for the company to find qualified matches and meet the end goal of a consistent, accurate and reliable master data file.
The implementation of new processes and procedures built on the DataFlux solution revealed a number of valuable lessons:
- Before doing anything, know your data. The InsightAmerica team used the profiling technology to provide a deep background on the data before attempting any data quality or data integration techniques.
- Break down processes into multiple steps. Creating better data is not as easy as getting from point A to point B. Set milestones and checkpoints for subprojects to make quality a more progressive goal.
- Push data quality techniques as close to data capture as possible. We learned that fixing problematic data only becomes more complex over time. By finding and eliminating data defects upstream, the rest of the data quality project is simplified.
After utilizing dfPower Studio to intelligently integrate data from the existing data sources, InsightAmerica noticed an improvement in overall data quality. The software allowed the company to apply consistent formatting for names, addresses and organization names. Further, through the use of delivery point validation information from USPS, the company could ensure that addresses within the records were not only standardized but also valid.
On the data integration side, the product's matching functionality reaped immediate benefits. DataFlux matching allows users to tune the sensitivity of the matching engine, allowing the team to more confidently uncover and resolve duplicate data across sources.
The increased quality of data transcended to the staff members who saw a decrease in the time and effort required to support new data feeds. With consistent, accurate and reliable data at the foundation of their efforts, the amount of scrap and rework to accommodate bad data was reduced dramatically, providing an overall productivity gain for the company.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access