Transforming Text and Data Into a True Knowledge Base

Register now

Extensive Google searches to locate current statistics on the size and growth rates of big data failed me today. I then realized why. Growth rates are so large and so dynamic; few, if any, are attempting to predict how much data is really out there. In our multichannel world, there’s simply too much data to digest. The variety of data, growth in unstructured data and challenges deciphering it prevent transformation from noise to meaning, and no industry is immune from this monumental shift in data complexity.

There are some rays of hope. Newer sources of data including Open Linked Data (LOD) are available. It’s free of charge, used by few, understood by less and powerful enough to distinguish you from the pack. The growth in unstructured data is running at breakneck speed and organizations are scrambling to keep up, but with the proper technology some are succeeding. These progressives have found a way to transform, connect, organize, query and analyze information to achieve big data enlightenment.

They have “knowledge bases” of integrated data, which provide a roadmap to discovery, decision support, better research, improved customer service, personalized patient care, higher advertising conversion rates, customer retention and more. If we relate it to the human mind, we have our own knowledge bases developed from experience, study, training, interactions, relationships and events. We use it every day to make decisions. And our brain, the engine behind the knowledge, has the ability to reason, extend knowledge, learn and draw conclusions. Imagine if your business leveraged a knowledge base replete with all of its dark data. Imagine the types of questions you could answer given instant recall, powerful reasoning and billions of related facts. 

While experts may will argue about on the definition of a “knowledge base,” most agree on these characteristics:

  1. Pre-existing knowledge
  2. The ability to absorb additional unstructured and structured data
  3. Linking capabilities to connect the dots between facts and the original source of information
  4. Reasoning powers to infer new facts or detect inconsistencies
  5. Methods to classify all the information
  6. A storage and maintenance capability to manage the knowledge flow 

Shine a light on dark data

There are many techniques to creating knowledge bases, but one gaining attention is grounded in natural language processing, or text mining. When done correctly, this has the power to structure unstructured data and create a new level of discovery. One option for storing the results from text mining is a knowledge graph, allowing for insights to be drawn in milliseconds.  

A unique approach in knowledge base creation

Text mining is more than just analyzing documents. A number of processes are involved that blend existing Open Linked Data (LOD) with your own data to create a more complete base of intelligence. For example, let’s say you have a set of data containing the names and addresses of businesses along with other information such as industry, revenue and employees. Like most data sets, it’s probably incomplete. Let’s also suppose you have validated text mining algorithms that run against your own data. They identify and extract other business names, locations and  facts. When a company is mentioned in a document, you can check your original reference data to see if it exists. Does it have the exact same spelling? Is it truly the same business but referred to differently? Should the new business be added to the knowledge base as a unique, new entry?

These questions refer to an approach to knowledge base creation that starts with facts, analyzes text, disambiguate meanings, resolves identities and stores additional knowledge with links back to the original text.   

As entities are identified inside free flowing text, they can be connected to other entities mentioned in the text and the reference data. For example, if the text states “His hotel was in Foggy Bottom located just East of Georgetown and not far from the White House,” text mining can identify entities such as “Foggy Bottom” and “Georgetown.” But are they places, people, organizations or something else? If your existing knowledge base already had pre-loaded facts about places, it can enhance the entities with any number of new insights.   

One of the steps in text mining is “relationship identification.” Once entities are identified and enriched, they are connected to other entities; for example, “Foggy Bottom is in Washington, DC”, “Foggy Bottom is near The White House” and “Foggy Bottom is east of Georgetown.” What just happened? We used Open Linked Data (LOD) to verify Foggy Bottom as a neighborhood that exists in Washington DC while also connecting it to other entities. LOD knows that DC is a “District” (not a state) and that it is within the United States.  Preexisting facts were combined with results from text analysis to expand the knowledge base.  

The Power of Inference

Suppose the data inside the knowledge base is stored in the form of a connected graph so that entities are forever linked to one another. If the knowledge base has the power to reason it can infer new facts. For example, if we know Foggy Bottom is in Washington DC, and Washington DC is a sub region of the United States, and The United States is in the Northern Hemisphere, we can infer that Foggy Bottom is in the Northern Hemisphere. The knowledge graph was able to connect the dots, think at the speed of the human mind and instantly reason to yield new information.

Extending the Knowledge Base with Classifications

The benefits of this approach are extensive. Text is discoverable since extracted knowledge is inextricably linked to the original text. Answers to your queries are more complete. Connections are more obvious. Data is presented to you in context, showing relationships between disambiguated entities, and stored in the same exact format in the knowledge graph thereby lowering maintenance costs.

One less obvious advantage has to do with entity classification. This addresses the problem of categorization while maintaining relationships. Based on your needs, entities can be categorized using “data dictionaries” that classify the entities. Foggy Bottom is a place. The White House is a building. Washington DC is a District (not a state). When these classifications are bound to the knowledge base, they provide it with a common language and way to ask additional, broader questions of the underlying knowledge. For example, a query such as: “Give me all the districts in the Northern Hemisphere” would return ‘Washington DC.” These classifications take on many forms of escalating complexity and power – Controlled Vocabularies, Thesauri, Taxonomies and Ontologies are all viable approaches to classification. They can be leveraged not only at the entity level but at the document level to group documents into their most relevant category.

There’s much more involved in the process of creating a knowledge base but these fundamentals repeat themselves across domains. Integrated data that has been analyzed and classified holds great potential to resolve mission critical questions. By interlinking text and data, leveraging public data sources that provide context and supplemental information, and identifying true meaning, businesses are constructing knowledge bases with the power to reason at the speed of the human mind. It is these large graphs of connected knowledge that are addressing decision support challenges worldwide.

Tony Agresta is the managing director of Ontotext USA, which addresses semantic technology challenges using text mining and graph databases.

For reprint and licensing requests for this article, click here.