All our knowledge brings us nearer to our ignorance, . . .
Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?

The Rock [1934] T.S. Eliot 1888- 1965

From the Enigma code-breaker to the Pentium processor, from databases to knowledge-based systems, and from hypertext to the World Wide Web, there has always been a single, common, unifying thread. All of these technologies have been developed progressively to address one agenda, to solve one overwhelming need--the need to discover, manage and organize data into useful information.

Most articles on knowledge management and data warehousing focus on business knowledge. We are applying data warehouse technology, in the form of knowledge discovery, to biotechnology and genetic engineering. Just as business has borrowed concepts from computer science and artificial intelligence, genetic engineering and biotechnology may, in turn, learn a thing or two from business applications.

Along with The Human Genome Project, efforts are underway to develop databases for the common white mouse, for various agricultural plants and for bacteria, such as E. coli. These efforts generate significant amounts of data. Using the data from these sources, researchers, pharmaceutical companies and venture capitalists are hoping for breakthroughs to create specialized treatments and drugs. Plans, which appear near-term, range from a cure for cancer to a better tasting tomato. These plans all seem within reach of the genetics part of the technology. Storing genetic data is relatively simple, but extracting useful information and interpreting it is the hard part. However, no killer application has emerged to collectively leverage this data into useful information. Knowledge discovery may change all that.

Knowledge discovery is the critical function and killer application needed to elevate biotechnology to be the success story in the next decade that information technology was in the previous decade. One goal is to create a genetic data warehouse template which can be used for knowledge discovery in the medical, pharmaceutical and agricultural industries.

Knowledge discovery, as a kind of data mining, can use artificial intelligence, machine learning and neural network techniques to discover new relationships by sifting through an ocean of data to distill fresh answers to unstated questions. The insights from these discovered relationships and trends can result in quantum leaps in understanding the information hidden in the database.

Ideally, we would like to build a data warehouse which contains DNA sequences, genes, their locations and functions, when they are active and their metabolic environment. Then we will define the meta data, so that biotech researchers might use the genetic data warehouse for knowledge discovery.

Knowledge Discovery

Knowledge discovery, also known as data discovery, can use data warehouse technology as a starting point. One of the differences between business-based data warehouses and genetics-based data warehouses is that business data tends to be better defined and much more complete than genetics data. In fact, the scarcity of completed genetic information is the reason that biotechnology is a growing field. Molecular biologists recognize the opportunity to extract valuable knowledge from the existing genetics databases. They are now building genomic databases through projects like the Human Genome project and other less publicized projects.

The first goal of these projects is to build a complete database of the DNA sequence of a human, animal or plant. These efforts are complicated by the fact that there is no one DNA molecule. Your DNA differs from mine by a small percentage, which is why you are different from me. However, with hundreds of millions of base pairs in a DNA molecule, a small percentage adds up to a large absolute number. So your DNA may differ from mine by a million base pairs. So whose DNA do they store in the Human Genome Database? Yours or mine? One approach is to use some average and apply a general search algorithm to match across specific examples.

Because the information is incomplete, the database is not where all the interesting opportunities are. It is like having a directory to the homes of the movie stars, with the addresses and phone numbers, but without the names. The opportunities in a genetics database exist in discovering new knowledge about how genes work. The current databases will only store the data of what the genes are, not their functions. For the functional knowledge, we want to create a data warehouse which will include other parameters along with the genetic database. And as complex as the genetic database is, the surrounding parameters are orders of magnitude more complicated. We plan to build a knowledge discovery, data warehouse to collect the new knowledge about these parameters. The steps used to build a genetics, knowledge discovery, data warehouse (an exploratory data warehouse) can easily be used for other applications.

Biotech companies, pharmaceutical companies and agricultural product companies are counting on the economic promise that will arise from mapping the human genome and the genome of other organisms. Identification of genetic patterns will provide molecular biologists with the information to treat illnesses, to create drugs and to grow new kinds of plants and food. Imagine being able to use genetic information to design a genetic treatment for cancer or heart disease that is specific to you. Or growing a potato or banana that you eat in order to gain immunity from a disease, rather than be vaccinated. Genetic knowledge discovery will support these developments and others. With these goals in mind, we have been using knowledge management techniques to create a general, genetic data warehouse model which can be used for knowledge discovery for medicine, pharmaceuticals and agriculture.

Important Gaps

To provide meaningful results, the knowledge discovery software needs extensive and detailed data. Because there are gaps in our knowledge of molecular biology, genetic data and the resulting databases are currently incomplete. In a typical database, gaps may not be important and might not even be modeled. But our conceptual leap has been to explicitly model all known gaps in the data and in the knowledge, as well as the known data and knowledge.

Our catalyst for this knowledge representation was a logic puzzle called, "Who Owns the Zebra," which uses ten statements to derive 23 facts. But these facts do not mention the zebra. The way to discover who owns the zebra is to fill in the other 23 facts, then observe what is missing. As a first approximation, this logic puzzle is similar to the questions that molecular biologists are asking of genetic databases.

However, the similarity diverges quickly. Rather than 23 facts, there are more than one hundred thousand genes. And your genes are different than my genes. What makes each of us unique is just that previously mentioned small percentage difference, but that difference makes genetic pattern matching more difficult than a simple database lookup.

A molecular biologist studies some genetic material that is a few hundred thousand DNA units in length. She wants to compare her sample to a known sequence of DNA material. This stretch of material may contain the gene for controlling a cyst, curing cancer or it may just be noise. How can a researcher determine if a similar sequence has ever been studied, especially if similar sequences are rarely exact matches? Our proposed solution is the use of a fuzzy pattern match. There are many ways to do fuzzy matches, ranging from simple wildcard searches to complex fuzzy logic routines, with many methods in between.

One of the in-between methods relates to how a spelling checker works. The old spelling checkers used a method called the Soundex Algorithm that mapped sounds, like a hard "k" to letter combinations such as k, c, ch, qu, etc. The checker maps the letters of a misspelled word to various combinations of letters with the same sound value, matches those to a list of words in a dictionary, then provides the resulting list back to the user. We are designing a similar routine for genetic pattern matching and creating the genetic equivalent of letter combinations to map into the various DNA sequences.

Implementation Plans

We see two steps to this process. Ideally, these steps are integrated, but we've separated them for ease of description and planning. The first step is building the data warehouse model and the second step is building the knowledge discovery model.

Building the Data Warehouse Model

There are seven steps to building a biotechnology data warehouse for genetic mapping and functional discovery.

  1. Logical Data Model: What are the known and predicted data elements?
  2. Functional Modeling: What are the known genetic functions?
  3. State Modeling: What are the known metabolic, chemical and genetic environments or states?
  4. Table Designs: What are reasonable designs for relational tables?
  5. Scientific Model: What are the scientific (or business) goals and relationships?
  6. Table Entities: What are the many to many relationships? Rationalize to create entities.
  7. Re-normalize: Normalize all data, relationships and models.

Most of these steps are fairly straightforward modeling exercises. Table entities and re-normalization are activities that may have other names. Table entities can serve as meta data, or unifying information about the data. Normalization eliminates duplications in the data. The intent of the re-normalization process is to eliminate and rationalize duplications and redundancies across models, relationships and tables. Overlaps, redundancies, and many-to-many mappings sometimes provide insight into ways to provide new views of the data. Of course, sometimes they may also reveal sloppy data management.
When creating the logical data model for the data warehouse, we used the third normal form: the key, the whole key and nothing but the key.

The Key: In a given table, each column and all its data relate to the primary key.

The Whole Key: The columns and data apply to the entire key, not just part of it.

Nothing but the Key: The columns and data relate to the primary key and not any other columns in the table.

This provides the best relational organization of the data, so that users can ask any question about all the data in the data warehouse. This unbounded flexibility is a prime requirement of knowledge discovery in biotechnology because the field is still not sufficiently mature to know what questions to ask. After the flexible data warehouse is designed, it is fine-tuned to encourage knowledge discovery.

Knowledge Discovery

SELECTION: Creating possible segmentation criteria for selecting data.

PREPROCESSING: Normalize, rationalize and cleanse the data.

TRANSFORMATION: Create general representation and tables of meta data.

EXTRACTION: Extract patterns from the data warehouse, turning data into knowledge.

INTERPRETATION AND EVALUATION: Evaluate utility of extracted and identified patterns.

The first three steps (selection, preprocessing and transformation) were done during the data warehouse stage. The fourth step, the extraction of patterns, is performed "automagically," applying AI or statistical techniques based on initial patterns created from the meta data. Clearly the meta data and knowledge representation activities are important. The interpretation of the results, the fifth step, can be partially automated based on predefined criteria; but, for the most part, it is a manual evaluation process. After initial evaluation, we will then iterate through the process with better defined pattern match criteria. If we could completely define the evaluation criteria, then the process would simply be a database lookup. But we can't yet define the nuggets of knowledge that we're looking for.

Conclusions

This has been a brief description of our efforts for knowledge discovery within genetic information. We defined a general knowledge representation architecture. We have designed the first draft of our logical data warehouse model for DNA sequences. We have also designed the first pass at a genetic pattern match routine.

We want our knowledge representation architecture to remain general to address questions that we don't yet know how to ask. The logical data warehouse model with the explicit gaps in the data elements supports the general architecture. In addition to the genetic data model, we will model the gene environment and biochemical parameters. Although this information is incomplete, statistical methods may help fill in the gaps. As we model other parameters, we will iterate the process and use the results to expand the genetic pattern match routine.

We are confident that our logical data warehouse model will provide concrete and useful knowledge. As we iterate our bootstrapped efforts to fill in some of the gaps, we expect to expand our knowledge discovery capabilities. Our preliminary results are consistent and look very promising. So we have confidence that our approach will lead to the discovery of the "genetic knowledge lost in all the information."

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access