All of the traditional aspects of data cleansing that apply to on-premise software systems also apply to cloud-based services and applications, but there are additional challenges to take into consideration. The types of questions that arise are:

  • Should we stage the data in an on-premise database?
  • Do we need a cloud-based test environment?
  • What if we can’t customize the software as a service business process?

This article addresses the challenges and many of the common questions that companies have regarding data cleansing in the cloud.
Data cleansing, also referred to as data scrubbing, becomes necessary and critical to preserve data integrity rules when: new systems are integrated with existing systems; a legacy system is retired and the data needs to be extracted and migrated into a new application or system; or a company merger or acquisition occurs and multiple systems require integration to ensure continued business as one entity. These three scenarios become a bit more complex because the integration points can be cloud to cloud or cloud to on premise.

Data cleansing is not a task; it is a process. In any implementation whether it is a cloud computing or not, the stakeholders of this process are the same.

  • The data cleansing task force is the team dedicated to build the process, bridge the relevant parties and deliver the final cleansing result.
  • The business owner is the ultimate owner of the new application/process or the output of the data cleansing. Business owners need to provide clear data quality requirements from the business perspective.
  • IT professionals take direction from the data cleansing task force and implement technical solutions to help the task force reach the end goal.
  • End users will eventually benefit from the data cleansing effort. They also provide the feedback to the data cleansing task force for further enhancements.

Data Cleansing Cloud-Based Challenges and Solutions

1. Should we stage the data? If so, where?

Data cleansing is detailed and important when bridging different systems. It is important to map out the cleansing process before kicking off the effort. A big part of the process is determining how to move the data from the sources to the targets. Many organizations are moving toward cloud computing so they do not have to maintain infrastructure. In staying with that goal, moving and cleansing data in one step from the source to the target is advantageous. Using connectors or Web services that the SaaS vendors provide, we can pull the data, manipulate it and load it in one process. However, this becomes difficult, if not impossible, if more than one source is involved, complex de-duplication is required, or the data volume exceeds the capacity that the SaaS vendors allow. In those cases, staging the data in an on-premise database may be advised.

There are alternatives that allow us to stage the data but still use cloud computing. One example is Microsoft SQL Azure. This is a database platform in the cloud that allows developers to store data anyway they want. So in other words, we are not tied to prebuilt software or a set data model. This type of solution can address the data migration and cleansing challenges associated with multiple sources, complex data cleansing rules and data volume limitation issues.

2. Do I need a test environment for my SaaS application?

The answer is yes. Some cloud-based vendors offer test environments as part of the license fee or separately. If an organization requires a large data migration or if they have significant data quality issues, they should invest in one or more test environments.

The advantage to a testing environment is the ability to proactively uncover and deal with exception data. Handling the exception data is part of the data cleansing tasks and the lifecycle for handling exception data is: detect the issue, resolve the issue and reupload the data. Resolving the exceptions is an ongoing way to improve the maturity of the data.

There are several ways to test, considering that for SaaS applications direct access to the database is rare to nonexistent. For applications that have reporting functionality, reports can be run to validate the accuracy of the data and the data migration process. Alternatively, the data cleansing task force can use the data export functionality from the application or from the Web services that the vendor supplies.

3. We can’t customize our SaaS business process.

Applications in the cloud may allow little customization of field names and process terminology, and this may not match with a company’s existing business vocabulary. When the company needs to adopt the cloud product’s terminology, building a data dictionary that clearly defines and maps each term is critical for the usability of the application. The data dictionary should also strive to create and document conventions, standards and best practices.

Why is creating a data dictionary important to data cleansing and migrations? Is ensures that the source to target mappings are correct. It also prepares the business in advance for possible changes to their process.

A great example comes from the authors’ company. We implemented a SaaS resource management system. The way this system allocates staff to projects and the associated vocabulary (field names, labels, menu items, etc.) is different than our original system, and the new resource management system is not customizable. We also use Salesforce.com. We perform ongoing data integration, and therefore cleansing, between these two systems for opportunity and project information. The vocabularies do not match, which can be confusing for developers and business users of both systems. We developed a data dictionary and other mapping files to map the vocabulary, process and unique identifiers between the two systems.

4. When scheduling my data cleansing and migration jobs, what timing issues should I consider?

Data migration for data cleansing should occur during times when business is slow and the end user is least impacted. SaaS vendors typically have planned maintenance windows on weekends and off hours. Plan data migrations around these windows to ensure the system is up and your own integration jobs will run as scheduled. Make sure you are aware of their planned maintenance schedule and create a contingency plan in case they have emergency unscheduled maintenance.

Data cleansing where cloud computing-based systems are the source and/or the target is very much like traditional data cleansing. The data will be cleansed, consolidated, transformed and migrated into the format of the new system. The fact that the source and/or target system is in the cloud does not necessarily impact the cleansing process itself, but there are additional things to consider. Cloud computing is still in the early adoption stages, and new products are being developed to facilitate easier data access and improved data quality. Finally, product vendors in the data integration and quality space are offering solutions for the most popular SaaS applications. As cloud computing continues to evolve, the data integration products will expand to more general use cases.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access