Editor’s Note: DMReview.com would like to welcome Faisal Shah to its list of online columns. Shah is co-founder and chief technology officer of Knightsbridge Solutions LLC, and his monthly column, "Managing Big Data," will focus on solutions in technology infrastructure, data architecture, ETL and applications. Shah will also co-author several columns with colleagues from Knightsbridge. "Managing Big Data" will be updated on the last Friday of every month.
It’s a data-driven world we live in, and data volume and complexity are escalating quickly. Every day, companies large and small constantly make new data. They make data when their customers perform transactions, they buy data from others and they multiply their detailed data by replicating it, aggregating it and so on. All this results in a lot of data dispersed inside enterprise environments.
This column is aimed at the IT professional who deals with high data volumes, and it will focus on technology infrastructure, data architecture, ETL architecture and applications in high data volume settings. I plan to rotate the topics regularly across these four subject areas.
What do I mean by high data volumes? It is a relative term because one company’s high volume is another’s pocket change. I have personally been involved with companies that have 100+ terabytes of storage in their IT departments. At the other end of the scale, if all the data associated with an application fits in main memory, we are not talking about high data volume. Although high data volumes is a relative term, you’ll know it if you have big data.
Given the four subject areas just mentioned, let’s take a quick peek into sample challenges often faced when working with large data volumes.
Let’s jump right in to a good, juicy, often-controversial subject. What hardware platform should house a large data application?
I have been involved with many IT departments in the midst of making this decision and, without exception, there has always been some sort of holy war between at least two parties involved in the decision. Sometimes it’s UNIX versus mainframe, other times it’s Wintel versus UNIX, and there are architecture battles around UNIX versus UNIX, MPP versus SMP or NUMA versus SMP. Often, the new guard tries to prove to the old guard that the new thing is better than the old thing. Of course, the new guard has nothing to go by but vendor promises or cool architecture diagrams (it looks great on paper, 1,000 kajillion hertz bus speed way cool!) or some recently published benchmark figures. Conversely, the old guard tries to prove how flexible that grand old technology is. If it can handle 50,000 online transactions per second, how can it not be the right platform for that 10- terabyte data warehouse? (Isn’t that like trying to win the Indy 500 by driving a jet fighter around the track?)
In future columns, I hope to help you work through debates such as these by providing some very pragmatic and usable guidelines. I’m personally very interested in the usable scale range for particular platforms. I plan to share that information based on how I see some of my customers deploy large data applications on these platforms.
There is all kinds of good debate fodder. How much preaggregation should we do before loading our repository? The database forces such a high storage data ratio, should we even be using a DBMS? How do we let our online transactional applications query our data warehouse without bringing the warehouse to its knees and without affecting the 24x7 availability of the online application?
Consider also that the reason data warehouses exist is to separate the analytic applications from the transactional ones. With the advent of real-time personalization and other similar applications, we now have to tightly couple these two types of applications. Does the transactional application query the data warehouse, or does the analytical application query the online store? Is an altogether new repository required at this intersection point? Further still, at this intersection point, how does one reconcile the issue of divergent availability characteristics for transactional and analytical applications? Of course, this problem is only interesting when the data volume is large enough to force decoupling the transactional and analytical applications. It’s not too hard to support both problem domains when all the data fits in memory.
Extraction, Transform and Load (ETL)
I use a broad definition for ETL. I define it as any data handling and processing that precedes final storage in the repository. I plan to tackle issues such as build versus buy, performing transformations and aggregating outside the database or inside the database, and managing meta data for heterogeneous data stores.
On this last point, I’ve found many ETL tools provide treatment to meta data and to data, but hardly any tools deal with the place where the two intersect. For example, given some meta data, I’d like to know all the data instantiations of that meta data or, if given some data, I’d like to know what meta data describes that data. Over the long haul, this turns out to be more than a subtle point. As meta data changes and different generations of an object are left lying around, consider the challenges. When we backed up that customer file in 1996, what meta data did that file conform to? Even if we know what meta data describes this archived file, do we know how to transform that file to conform to today’s meta data? How much work will it be to perform this transformation? The challenge posed by this problem is purely not one resulting from large data volume, but the problem does seem to be exacerbated by the sheer number of applications and the rate at which data objects morph in large organizations who, incidentally, often deal with large data volume.
Large data volume applications are often not off-the-shelf solutions. Enterprises must build these solutions from scratch or integrate several lower-level applications. In future columns, we will present some of these novel applications and candidate solutions.
For example, an Internet application service provider has convinced a few million users to install an ad viewer on their workstations. The service provider analyzes what kinds of pages the user browses and attempts to present ads to the user that might be of interest. The browsing activity generates history and, if properly organized and stored, can effectively describe the user’s interests and behavior. This information can be used to choose the types of promotions to which the user is most likely to respond. The technical challenges in this type of application include the very high volume (millions of users multiplied by hundreds of page views per day) and the rapidity with which decisions must be made the decision as to which ads to show may have to be made in a second or less. Applications such as this often pose a challenge because the cost of computing the result for a user is higher than can be fit within a subsecond transaction. Precomputing these results is often a candidate solution but, out of a million- user customer population, the percentage of these users that will actually be served tomorrow is probably small (one to five percent). Should we really precompute the results for all users when we need this information for only a small minority of the population?
In future columns, I will attempt to deal with these and similar topics. Given the limited space, I don’t expect to be able to provide a complete treatise on any of these subjects, but hope to get straight down to brass tacks. I’ll attempt to expose the challenging points and how one might successfully handle these issues. I welcome feedback from readers both about the column as well as preferences for future topics.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access