The Case for Push-Down Analytics

  • January 12 2011, 6:25am EST

Every organization is trying to leverage its data on customer behavior to its competitive advantage. Marketing teams in every enterprise are trying hard to collect details about their customers, capture their taste/preferences and retain existing customers by providing them with the best personal experience.

For many enterprises, the ability to collect data is no longer a problem and in reality has resulted in enormous amount of transactional data about their customers. The real challenge is to build analytics on top of the data and get them to work the moment the customer is ready.

It typically takes a fairly long time to build analytics out of transactional data and convert those insights into any tangible marketing action in any organization. The primary reason for this delay is that analytics is traditionally kept separate form the database layer. This can result in data replication, and with big data, agility does get compromised to an extent. The reason analytics is typically externalized (from the DB layer) is possibly because of the inadequacy and non-procedural nature of SQL, the language that has evolved as a standard to manipulate structured data. SQL in its native form is not meant for analytics but is intended for data storage, retrieval and creation of simple summaries.

With enterprises becoming more aggressive in reducing their go-to-market time, alternate solutions to relational database management system products and solutions are emerging that focus on performance, efficient data storage and in-database analytical capabilities. New players in the DBMS market are trying to differentiate their solutions with these capabilities.

Look for five key technical characteristics (which will be demonstrated either in parts or in combination) in products and solutions designed to provide in-database analytics functionality.  

  1. Parallel Shared-Nothing Architecture: Building analytics quickly on big data can only be made possible with multiple units of work executing as independent parallel processes. Performance can be enhanced with each individual processes not sharing their own memory and disk space (shared nothing), thus avoiding locks on resources. An intelligent data partition strategy along with a pipeline data flow framework on top of the shared-nothing infrastructure will also help boost performance. A shared-nothing architecture is highly scalable, and almost every data warehouse DBMS product in the market (Teradata, Netezza, Greenplum, etc.) is based on this architecture.
  2. Programming Framework for Creating Customized Summaries: Analytics is all about intelligent summaries that identify patterns on large transactional data – much more so than the summarization capabilities offered by SQL. An analytical engine will need a parallelizable recursive programming framework where the analyst can plug in his or her customized logic (with user-defined functions, if needed) to create summaries. This framework is important to any organization that tries to instill a data-driven culture. The MapReduce programming model, developed by Google in 2004, provides this framework. Many DW DBMS vendors (Greenplum and Aster Data, for example) have already embraced this framework and provide some form of support to this as part of their software. Vendors such as Vertica have incorporated Cloudera’s DBInputFormat interface, which enables Hadoop (open source implementation of MapReduce) developers to push map and reduce operations.
  3. Analytical DB Engine: Speed is achieved by moving data processing software components closer to the data storage hardware components. Solutions that try to provide responsive analytics will look at pushing their analytical capability closer to the DB layer or/and pushing DB processing to the data storage layer (which I describe further in the following hardware acceleration section). While flexibility could be lost with this approach, there is no doubt performance will be better since data replication is low during processing. Because analytics need not always be built from structured data, the analytical DB engine should be generic enough to also handle or query unstructured data (such as logs, click stream, etc.). DB vendors may offer this analytical capability as part of the DB engine (e.g., Aster Data’s relaxed SQL format SQL-MR) or accomplish this in collaboration with analytics vendors (e.g., SAS) with each providing native support to the other.
  4. Hardware Acceleration: This is an extension of the analytical DB engine concept in that DB processing is pushed to the data storage. For example, Netezza accomplishes this through field programmable gate array chips, wherein queries are executed close to the storage and records are eliminated as early as possible. This means that the “select” and the “where” clauses are executed as the data is streamed out of the storage.
  5. Smart Data Storage and Retrieval: Query performance is improved if there is an infrastructure to avoid full table scans and retrieve data quickly. Traditional RDBMSs have tacked this problem through indexes, but in-database analytics need something more than indexes to ensure query performance is really fast. Extremely fast query performance can be achieved by adopting either (or all) of the following three approaches that move away from the conventional row-based storage found in many databases.

 Columnar DB: Analytics is all about collecting data patterns, which boils down to analyzing specific column data for a large number of records. With reads from a storage device typically done in blocks, query performance can be immensely improved by storing column data together (instead of in rows). This ensures that fewer blocks of data will need to be read and thus faster responses.

Correlation database: A correlation database is about storing a particular data only once in a database and ensuring the same data is referenced by all the corresponding records in the database. Data redundancy is very low in this approach, and data quality is very high because the database avoids duplicate storage of the same information. Storing data only once and building an inverted index kind of a structure provides faster response for ad hoc queries on the database. Illuminate is an example of a vendor that has commercialized CDBMS.

Data compression: Compressing data during storage ensures that less data will be read during reads. While this involves an overhead process of decryption, some DB vendors achieve better performance through compression and decrypting the retrieved data in-memory. Infobright and SAND are some vendors that provide this functionality.

These five are a few of the approaches available, but others exist. Some database solutions add even more speed in data retrieval by classifying data as hot (frequently accessed) and cold (not-so-frequently accessed). Hot data is stored in smaller disks and cold data is stored in larger disks. Another way to fast retrieval is by broadcasting smaller tables to all the partitions ensuring faster joins across partitioned data.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access