Free Site Registration

The Evolution of Analytic Platforms

Information Management Magazine, May 2009

William McKnight

Companies of all sizes everywhere are engaging in business intelligence. When done properly, nothing adds value to business in today's economy more than BI. It's the face of the modern competitive landscape. However, in laying down the technical foundation to support the underlying data, these same companies are faced with an exponentially higher number of variations and distinct departures from the traditional online transactional processing database management systems than ever before. These options have been created in recent years largely due to the high cost of ownership of the traditional approaches, as well as the hit-or-miss nature of companies in achieving acceptable performance in their analytics.

Analytic queries lie at the heart of BI. However, analytic query and daily reporting have fundamental differences. With analytic query, the results are desired immediately. Each query can also be a spur-of-the-moment, fleeting thought - the first in a long series of thoughts that lead to business change. These thoughts can only be capitalized on if query performance is not just good, but great. Large volumes exacerbate the performance difference between acceptable and unacceptable analytic query performance.

Analytic query performance continues to be difficult to achieve in the traditional architectures. Any organization that has lived with a data warehouse or other large data store in production for a while knows that manageability for change is just as essential. This includes the IT department who executes the changes, as well as the users in the business who deal with service levels provided by IT.

Ironically, what usually leads to difficult-to-manage databases is trying to achieve acceptable query performance. Manageability and performance go hand in hand. Under time pressures to deliver reports, well-crafted data models that facilitate a solid understanding of the database to the user and deliver performance to the queries are increasingly harder to find. Although it lacks standards, the quick-turn data model, often stays around for years. Once the query environment adapts to a model, regardless of the difficulty in doing so, the model sticks and a patchwork of performance techniques is embedded into the culture. 

Today, typical data management environments have several restrictions they’ve "learned to live with” in seeking acceptable analytical performance. Indexes, summary tables, cubes, utility executions like reorganizations and various denormalizations are often introduced into the environment. There's a point at which overcoming these challenges ceases to be easy and affordable and therefore ceases to get done. But perhaps more important than these very real restrictions are the limitations in thinking about possibilities and information exploitation.

Well-crafted data models as well as the performance tuning can be acceptably accomplished with expert data modelers and database administrators. However, these are highly mobile individuals, and tying environmental performance to their availability and cost is less appealing.

Data Storage Blocks

While results are what matters most when it comes to a database management system, having an understanding of what goes on at the molecular (i.e., bits and bytes) level is helpful in understanding the fundamental differences between traditional and newer approaches. 

Data is ultimately stored as bytes. These bytes are grouped by the several thousand, from 4,000 to 64,000, into “pages” or “blocks” is the unit of input/output – an exception being the “prefetch” capability of row-oriented database management systems sensing a pattern in the reads. In a row-wise oriented DBMS, blocks contain headers and footers comprising metadata that organize the information the block contains. The footer contains pointers to where the records begin in the block. Records can be fixed or variable length. Each column in the record, stored consecutively in a traditional DBMS, is comprised of enough bytes to represent the value. In the case of a variable length column, an extra 2 bytes precede the column which contain its length. In the case of a nullable column, an extra byte precedes the column containing a yes/no condition for the nullability of the column. 

Data type determines the column length. For example, an integer is 4 bytes (and 4 bytes can contain numbers up to +/- 32,000). A character (10) is 10 characters since each character is a byte. And so on. 

Records, which always contain a few, fixed bytes in the header, can be a few bytes up to thousands of bytes, but they cannot span the block. Some data types, like unstructured data, stored in file structures, can be pointed to from columns which are just pointers.

Block headers, on the other hand, contain valuable statistics about the block, such as the number of records contained, the timestamp of the last statistics run and so on. Blocks can contain records from multiple tables, and they may be interwoven, depending on the planned or unplanned co-location strategy presented to the DBMS in the data definition language.

Relational Theory

The relational theory, invented by E. F. Codd in 1969, is one of the most powerful theories in computing, ever. It dictates the way the DBMS processes and the way organizations design the tables that are placed into the data storage blocks. Relational DBMS are an implementation of the relational theory.

The relational theory drove the development of the structured query language used to interface with the data storage blocks of the DBMS. It provides a much-needed level of insulation between the user (business or IT user) and the data storage blocks. It states that it does not functionally matter in what order the columns are within the table or in what order the rows are within the table. Whatever you ask for with SQL, you will get. The relational theory is also the foundation that allows for the join of data from multiple tables into a single result set, something that many queries, analytic and otherwise, need to do.

Mix and Match

Options available today for analytics do not have to be considered “black box.” A little knowledge can help a prospective buyer or someone considering rearchitecture understand how the solutions execute analytics and therefore what they will do best. The key architectural dimensions these days are row-wise versus columnar, the systems architecture, megavendor versus independent, delivery mechanism and packaging level. Options are available at many of the intersections of the possibilities across the dimensions.

Columnar Orientation

A columnar DBMSs is an implementation of the relational theory, but with a twist. The data storage layer, discussed earlier, does not contain records. It contains a grouping of columns. For example, all the column one values are physically together, followed by all the column two values, etc. The data is stored in row order so the 100th entry for column one and the 100th entry for column two belong to the same row.

Columnar DBMS do not have a comparable restriction to the row-wise limit of records per page. There can be thousands of values in a block. Due to the variable column lengths within a row, a small column with low cardinality, or variability of values, may reside completely within one block while another column with high cardinality and longer length may take a thousand blocks.

The columnar block is much larger than row-wise and contains metadata as well, but much of it is dedicated to the compression dictionary, where the compression routines are located, and is ultimately very important for placing and locating column values within the block.

In columnar, all the same data – your data – is there. It’s just organized differently. The main benefits include the ability to highly compress the data. Column orientation greatly assists a compression strategy due to the high potential for the existence of similar values in columns of adjacent rows in the table. The 50 percent average compression in columnar implementations is high by comparison to other DBMSs. Compression packs more data into a block and therefore more data into a read. Since locating the right block to read and reading it are two of the most expensive computer operations, you want to get the most out of it.

A byproduct of the compression is the ability to perform columnar operations like MIN, MAX, SUM, COUNT and AVG very rapidly. All the data needed for such operations are co-located. Likewise, the fewer the columns needed in a query (for both presentation and predicate analysis), the more advantages a columnar approach will to have. Processing that requires dozens of fields will not perform as well.

These are the major differences. The main reason why you would want to utilize a columnar approach is simply to speed up the performance of analytic queries. There is no difference in the SQL or data access tool used to interface with the data or the logical data modeling. 

The columnar approach to data management is not new. In fact, the mainframe database Model 204, first deployed in 1972, uses a similar approach. However, the columnar time may have finally come.

System Architecture

Parallelism is key to query performance success, and massively parallel processing systems provide the most parallelism possible. MPP systems evolved from symmetric multiprocessing systems which, in turn, evolved from uniprocessor systems. 

Uniprocessor systems hit their bottleneck in CPU processing as companies turned up the workloads to enterprise levels. The signal speed was limited by the speed of light, the circuit sizes were limited and higher performing processors cost exponentially more than their performance take-up. The hardware vendor response was parallel processing.

First-generation parallelism was SMP and included functionally specialized, data and workflow parallelism. The added processor(s) alleviated some of the CPU bottleneck, but created another one - in the bus that is used to connect the CPUs to the memory and the I/O unit which accesses the disk. Clusters allowed multiple collections of CPU, memory and bus, called nodes, to work together over an “interconnect.” With every advancement, companies were enjoying greater processing speed and throughput.

Disk may or may not be shared in the cluster architecture. Sharing nothing minimizes disk access bottlenecks and is preferred in MPP, along with an exponential uptake in the number of processors and the sophistication of the interconnect. MPP is generally the preferred architecture for analytic query, BI and data warehousing.

Vendor Longevity 

Advertisement

Page 1 of 2.

Advertisement

Advertisement