Continue in 2 seconds

Data Modeling, Data Warehousing and Data Mining

  • November 01 1998, 1:00am EST

Every day, massive amounts of data pour into your business--sales transactions, inventory figures, billing information, marketing campaign responses and much more. In fact, more data arrives at your business each day than anyone can be expected to look at, let alone successfully use for decision making. Operational applications that run your day-to-day business processes also generate a tremendous amount of data that can provide added value when used as input for decision making. But most organizations suffer from an overabundance of redundant and inconsistent data that is difficult to manage effectively and use for decision making. The reality is that effective decision making relies on information that can originate from end users and data professionals alike. To ensure the quality of that data is, of course, essential. Without data quality you have nothing to rely upon. You can only ensure data quality with a foundation of sound data modeling. Once you've completed your data modeling to build a reliable data warehouse, you can start to mine the data for new market opportunities. And that's where the excitement begins.

Traditionally, the application development and execution environments were separate, requiring separate administration teams with separate controls and versions of data assets. As a result, the data assets used to build application systems were not guaranteed to be in sync with the operational databases used to run the business, leading to problems as the applications were deployed. Add in the rapid acceptance and implementation of data warehouses, and this synchronization becomes even more complex and more difficult to manage.

In this new, complex environment, the data professional now has three different domains to manage--the development environment, the execution environment and the data warehouse--a classic "three-body problem." If you've ever played billiards, a three-body problem is the definition of how the balls bounce across the table. As soon as you put three balls out there and hit one of them, it's an unsolvable problem in terms of where they're going to end up. Those three domains have to coordinate themselves; but like three billiard balls on the table, as soon as one moves and strikes the others, what happens? There's the possibility of chaos.

So how should you manage the chaos? As a data professional, you need to establish a data management architecture that allows for the maximum reuse of data as it moves through its life cycle. The architecture should encompass the creation of the business model (usually referred to as a conceptual data model), transformation of that conceptual model into a relational model and then the transformation into a physical model that can be optimized for your target database environment. This data management strategy works well for building the operational systems that run most businesses today.

Databases continue to become increasingly complex as they provide the basis not only for line-of-business applications, but also for searching for new market opportunities as the global marketplace becomes more competitive. Now, more than ever before, you need to understand the structure of the data your enterprise uses in order to model, design and maintain these complex databases.

Data analysts need help from end users and executives to enable them to understand their business data and to design the data that will be most useful in managing the organization. Analysts need clear ways of diagramming the data and easy ways to validate their models with the vast spectrum of users. Entity-relationship (ER) diagrams have traditionally been an essential part of the overall planning and maintenance of an organization's information resources, but recently new techniques of data modeling (star schema and snowflake) have been introduced to support data warehouse design.

The results of traditional ER data modeling still tend to be directed at operational databases; that is, those that are usually performance sensitive and must support high volumes of transactions. These databases are typically very specialized and are designed to support a very specific set of application requirements. Currency of the data in these databases is much more important than keeping historical data, and the data is continuously changing. The data for operational applications is typically dispersed throughout an enterprise, making it difficult to manage and avoid redundancies. But once a database contains data, it becomes a potential source of information for the data warehouse and for subsequent data mining activities. So why has this data become so important? Well, as businesses have flattened their organizations and pushed decision making down to lower levels, not all decision-makers have access to all of the data needed to make effective decisions. Add in the possibility of redundant data spread over many areas, and the resulting business decisions can be disastrous.

Businesses have also discovered the value of building data warehouses. This new mega-database enables end users and executives to make decisions based on data that was previously unavailable to them in a single place. For the data professional, the newest challenge is to design an optimized relational database that satisfies a much different set of requirements. But the overall approach isn't that different from the operational world you've been familiar with.

Who says that modeling in a data warehouse environment is quite different? You can still use traditional ER modeling tools to create the logical schema (many vendor tools provide this function), or you can use a more specialized approach such as star schema or snowflake (which a few vendors explicitly support today). Why is everyone talking about star schema modeling? Because it has numerous benefits:

* Star schemas are easy to understand.

* Hierarchies are easy to define.

* It can reduce the number of physical table joins.

* There is low maintenance.

* The meta data is very simple.

As with all modeling methodologies, there are a few drawbacks. For example, having summary data in the fact table typically yields poor performance for the summary levels, and huge dimension tables can be both a performance and management problem.

You need to remember that warehouse databases are usually subject-oriented and are not expected to support high volumes of transactions. They represent a cross-enterprise view of the "best" data and are often called enterprise databases. Currency of this data is important, but the historical information is significantly more important. Data warehousing enables you to transform data into useful and reliable information to support business decision making. Data mining then uses the historical information in the data warehouse to identify new market opportunities.

Data mining allows businesses to go into their data--all of their data--and extract significant, previously unknown pieces of information from large databases and then use that information to make important business decisions.

A number of factors have brought data mining to the attention of the business community:

* General recognition of the untapped value in large databases.

* Consolidation of database records tending toward a single customer view.

* Consolidation of databases, including the concept of a data warehouse.

* Reduction in the cost of data storage and processing, which facilitates collection and accumulation of detailed data.

* Intense competition for customer attention in an increasingly saturated marketplace.

* Movement toward the de-massification of business practices.

De-massification is a term from Alvin Toffler. During the industrial revolution, economies of scale led businesses to mass manufacturing, mass marketing and mass advertising. The information revolution is providing the capability to custom manufacture and to market and advertise to small segments and ultimately to the individual consumer. Businesses are now being forced to understand their customers, including their demographics, purchasing patterns and much more. The understanding techniques that can be employed include:

* Finding the answer to standard questions.

* Finding the answer to new questions, or "hypothesis testing."

* Mining the database for patterns, rules or functions.

Most business operations today are supported by regular reports produced by canned database queries. These queries are crafted interactively with the end user. The reports are built to answer a standard set of questions, and they form the basis of most executive decision making today.

A more unstructured approach to understanding information about customers consists of ad hoc database queries. For example, an analyst may formulate the hypothesis that "more short sleeve shirts are sold in California than in Maine." To prove (or disprove) this simple hypothesis, the analyst must formulate a database query, get a report back and check the results.

A more complex form of hypothesis testing is regression modeling. This form of standard statistics always begins with a hypothesis, such as "the sales of short sleeve shirts in Maine increases linearly with the average outside temperature." Then, a model is built and tested to see how well the data fits the hypothesis. In reality, the analyst attempts to disprove the null hypothesis. Finally, the technique of predictive modeling has been extremely effective, since it does not require that the data modeler come up with the idea. The modeler merely hypothesizes the form and shape of the relationship between data.

Data mining is a useful set of tools and another approach that combines discovery with analysis. Data mining is not a newly discovered branch of mathematics embodied in software that will, when hooked-up to a large and problematical database, inexplicably and inevitably reveal the business insights contained in its millions of records. Yet it is important and will become increasingly important in helping businesses to remain competitive.

Data mining is the process of providing an automated presentation of patterns, rules or functions to a knowledgeable user for review and examination. The data analyst plays an essential role in the process because it is only the analyst who can decide whether a pattern, rule or function is interesting, relevant and useful to the enterprise.

While data mining techniques need to be tailored to specific business needs, they still apply across industry and organizational boundaries. As we all suspect, the two key factors that most greatly influence the results of data mining projects are the quality and accuracy of the customer data originally entered into a database.

In a sense, data mining can be reduced to the imperative "show me" applied to a database. There are two approaches, one less interactive than the other. The first approach is to start a program, let it identify patterns, rules or functions, and then have the analyst review them for value. The second, more interactive approach is called exploratory data analysis. Here the analyst says show me the data in a certain way, views it, transforms it and reviews it. The analyst goes back and forth, exploring the relationships displayed often with unique methods of visualization. In a hybrid approach to data mining, a program can run, identify patterns and then feed into an exploratory tool where the results are visualized. Insights gained can then be used to feed back and modify the original program's operation, establishing a cycle where the problem is more deeply unraveled each time through it.

IBM's advanced data mining approach is based on leveraging information by using a range of tools including query and reporting, data interpretation, multidimensional analysis and knowledge discovery. These tools support both verification-driven data mining, in which simple questions are expressed in SQL, the standard relational database language and discovery-driven data mining, which actually suggests a hypothesis for the more sophisticated and subtle queries.

As businesses restructure and change to become more competitive, whether through reengineering, downsizing or streamlining, the need for removing inefficiencies and leveraging key assets becomes even more critical to success. Turning organizational data into an information asset is key to gaining competitive advantage. Control of information assets is required from the initial conceptual modeling steps through the implementation and deployment of data, as well as in the management of operational and decision support applications. Data modeling should always be the first step.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access