JAN 31, 2008 3:02am ET

Related Links

Predictive Modeling Making Insurer Inroads
February 8, 2012
Biting the Bullet for a Core Upgrade
February 6, 2012
The CRM Shift
February 3, 2012

Web Seminars

Getting Started with Big Data
Available On Demand
Transactions & Interaction: The Correlation of Structured and Unstructured Data
Available On Demand
Deliver Better Enterprise Data through Better Reference Data Management
Available On Demand

The Problems of Megadata Searches

Print
Reprints
Email

Imagine that you are newly arrived on planet Earth and faced with going to a public library. You are completely unaware of how a library works, but you understand books themselves. You enter and start to look at every single book until you find one on your chosen subject - possibly something along the lines of Social Etiquette for Earth Visitors. Having found this book, you go away, read it and return the next day to find another book on the same subject. You start again from the very beginning, looking at every book (including all the ones you looked at yesterday and found no interest in), until you find another book on the same subject. You patiently continue to do this, day after day, until someone offers to help you.

Why am I harping on about such a case? Well, it seems to me that the above method is the main means the majority of companies utilize in dealing with their own data. When a report is required, it is run against the complete database, often day after day, replicating the same searches it has done many times before.

In the library, the person who will help will be the librarian - a person who has built up a wealth of knowledge and multiple means at their disposal to help identify and locate what you need. In the data center, it has tended to be the use of relational databases, with multiple indices and fast and expensive hardware platforms.

Increasingly, however, massive databases on expensive hardware don’t seem to be enough. Running reports against such megadata stores is still slow, with some reports taking many hours or even days to complete. Set against this is the need for organizations to be far more fleet of foot, responding to changes in the market at near real time.

Unfortunately, the continuing growth in data quantity, often combined with a lowering of data quality, does not fit well with the need for speedy reporting.

Back to the librarian. If a library took the same approach as a historical database, all new books would just be put on any shelf and a member of the public would have to search through each book until they found the one they wanted, much like our alien earlier. Luckily, librarians have spent time in coming up with what appears to be a simple means of enabling books to be identified far more rapidly.

Let’s take this at its most basic level - a librarian using a paper-based system. The librarian has a set of cards, each of which covers a certain aspect of indexing books. When a new book comes in, the librarian takes specific information from the book, and adds this to each card as needed. For example, the author is added to one card, the subject of the book to another, where the book is physically located to another. Many of these cards will refer to other cards, so that the librarian can easily move from one item to another as required. When a member of the public comes looking for the book, the librarian can easily retrieve information on the book, no matter what the information is that the member of the public provides. If the person wants more books by the same author, it will be under the author card, if more books on the same subject, look it up on the subject card. The librarian hasn’t needed to refer to the overall database of all available books. Instead, they work against very small, dedicated subsets of information.

This is the basis behind the use of standard indices, but the one thing that seems to be missed by many of the database vendors is the one key fact. Each new piece of data is dealt with as it enters the system, not once it is already there. The incoming book details are dealt with immediately and each new book is just that - one single item, rather than an increment in the overall massive database. For many organizations, a 10 million record database will only have a few hundred main equivalents to the librarian’s cards, which means that in-line, real-time reporting becomes possible. Each new piece of incoming data can be dealt with in a very rapid manner, with the pertinent information being added to the relevant cards before the main record is created in the master database.

So, let’s look at how this would work in the online world with the example of a customer on a Web site creating an order for an item. What information do we already have on them? If it’s held in a standard database, we run a standard search against the existing information generally using an indexed field and pull up their existing record. However, if the customer provides information that is held as a non-indexed field, the search becomes massively slower. But, if we use this in-line mode, all we have to do is to go to the equivalent of the customer library card and see if they are already there. If not, we add them as a new record. If they are, we see which other cards this card points to for other information, such as previous purchases, payment records, possible upsell items and so on. What provides the main speed here is that these individual virtual cards are permanent records - but are far smaller than the underlying database itself, by several orders of magnitude. Therefore, searching through these is almost instantaneous and incoming queries can be dealt with immediately and effectively.

Filed under:

Advertisement

Comments (0)

Be the first to comment on this post using the section below.

Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.
Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.