JAN 12, 2011 6:25am ET

Related Links

Oracle to Buy Social SaaS Provider Vitrue
May 24, 2012
Analytics Guru Thomas Davenport to Keynote Health Data Management Expo
May 21, 2012
Health 2.0: IT Dreams and Uncertainty on Display
May 21, 2012

Web Seminars

Treating Big Data Performance Woes with the Data Replication Cure
Available On Demand
Data Discovery for Big Insights
Available On Demand
Business Insight at Your Fingertips: Bringing Analytics to the Masses
Available On Demand

The Case for Push-Down Analytics

Print
Reprints
Email

Every organization is trying to leverage its data on customer behavior to its competitive advantage. Marketing teams in every enterprise are trying hard to collect details about their customers, capture their taste/preferences and retain existing customers by providing them with the best personal experience.

For many enterprises, the ability to collect data is no longer a problem and in reality has resulted in enormous amount of transactional data about their customers. The real challenge is to build analytics on top of the data and get them to work the moment the customer is ready.

It typically takes a fairly long time to build analytics out of transactional data and convert those insights into any tangible marketing action in any organization. The primary reason for this delay is that analytics is traditionally kept separate form the database layer. This can result in data replication, and with big data, agility does get compromised to an extent. The reason analytics is typically externalized (from the DB layer) is possibly because of the inadequacy and non-procedural nature of SQL, the language that has evolved as a standard to manipulate structured data. SQL in its native form is not meant for analytics but is intended for data storage, retrieval and creation of simple summaries.

With enterprises becoming more aggressive in reducing their go-to-market time, alternate solutions to relational database management system products and solutions are emerging that focus on performance, efficient data storage and in-database analytical capabilities. New players in the DBMS market are trying to differentiate their solutions with these capabilities.

Look for five key technical characteristics (which will be demonstrated either in parts or in combination) in products and solutions designed to provide in-database analytics functionality.  

  1. Parallel Shared-Nothing Architecture: Building analytics quickly on big data can only be made possible with multiple units of work executing as independent parallel processes. Performance can be enhanced with each individual processes not sharing their own memory and disk space (shared nothing), thus avoiding locks on resources. An intelligent data partition strategy along with a pipeline data flow framework on top of the shared-nothing infrastructure will also help boost performance. A shared-nothing architecture is highly scalable, and almost every data warehouse DBMS product in the market (Teradata, Netezza, Greenplum, etc.) is based on this architecture.
  2. Programming Framework for Creating Customized Summaries: Analytics is all about intelligent summaries that identify patterns on large transactional data – much more so than the summarization capabilities offered by SQL. An analytical engine will need a parallelizable recursive programming framework where the analyst can plug in his or her customized logic (with user-defined functions, if needed) to create summaries. This framework is important to any organization that tries to instill a data-driven culture. The MapReduce programming model, developed by Google in 2004, provides this framework. Many DW DBMS vendors (Greenplum and Aster Data, for example) have already embraced this framework and provide some form of support to this as part of their software. Vendors such as Vertica have incorporated Cloudera’s DBInputFormat interface, which enables Hadoop (open source implementation of MapReduce) developers to push map and reduce operations.
  3. Analytical DB Engine: Speed is achieved by moving data processing software components closer to the data storage hardware components. Solutions that try to provide responsive analytics will look at pushing their analytical capability closer to the DB layer or/and pushing DB processing to the data storage layer (which I describe further in the following hardware acceleration section). While flexibility could be lost with this approach, there is no doubt performance will be better since data replication is low during processing. Because analytics need not always be built from structured data, the analytical DB engine should be generic enough to also handle or query unstructured data (such as logs, click stream, etc.). DB vendors may offer this analytical capability as part of the DB engine (e.g., Aster Data’s relaxed SQL format SQL-MR) or accomplish this in collaboration with analytics vendors (e.g., SAS) with each providing native support to the other.
  4. Hardware Acceleration: This is an extension of the analytical DB engine concept in that DB processing is pushed to the data storage. For example, Netezza accomplishes this through field programmable gate array chips, wherein queries are executed close to the storage and records are eliminated as early as possible. This means that the “select” and the “where” clauses are executed as the data is streamed out of the storage.
  5. Smart Data Storage and Retrieval: Query performance is improved if there is an infrastructure to avoid full table scans and retrieve data quickly. Traditional RDBMSs have tacked this problem through indexes, but in-database analytics need something more than indexes to ensure query performance is really fast. Extremely fast query performance can be achieved by adopting either (or all) of the following three approaches that move away from the conventional row-based storage found in many databases.

 
Columnar DB: Analytics is all about collecting data patterns, which boils down to analyzing specific column data for a large number of records. With reads from a storage device typically done in blocks, query performance can be immensely improved by storing column data together (instead of in rows). This ensures that fewer blocks of data will need to be read and thus faster responses.

Correlation database: A correlation database is about storing a particular data only once in a database and ensuring the same data is referenced by all the corresponding records in the database. Data redundancy is very low in this approach, and data quality is very high because the database avoids duplicate storage of the same information. Storing data only once and building an inverted index kind of a structure provides faster response for ad hoc queries on the database. Illuminate is an example of a vendor that has commercialized CDBMS.

Advertisement

Comments (0)

Be the first to comment on this post using the section below.

Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.
Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.