Companies of all sizes everywhere are engaging in business intelligence. When done properly, nothing adds value to business in today's economy more than BI. It's the face of the modern competitive landscape. However, in laying down the technical foundation to support the underlying data, these same companies are faced with an exponentially higher number of variations and distinct departures from the traditional online transactional processing database management systems than ever before. These options have been created in recent years largely due to the high cost of ownership of the traditional approaches, as well as the hit-or-miss nature of companies in achieving acceptable performance in their analytics.
Analytic queries lie at the heart of BI. However, analytic query and daily reporting have fundamental differences. With analytic query, the results are desired immediately. Each query can also be a spur-of-the-moment, fleeting thought - the first in a long series of thoughts that lead to business change. These thoughts can only be capitalized on if query performance is not just good, but great. Large volumes exacerbate the performance difference between acceptable and unacceptable analytic query performance.
Analytic query performance continues to be difficult to achieve in the traditional architectures. Any organization that has lived with a data warehouse or other large data store in production for a while knows that manageability for change is just as essential. This includes the IT department who executes the changes, as well as the users in the business who deal with service levels provided by IT.
Ironically, what usually leads to difficult-to-manage databases is trying to achieve acceptable query performance. Manageability and performance go hand in hand. Under time pressures to deliver reports, well-crafted data models that facilitate a solid understanding of the database to the user and deliver performance to the queries are increasingly harder to find. Although it lacks standards, the quick-turn data model, often stays around for years. Once the query environment adapts to a model, regardless of the difficulty in doing so, the model sticks and a patchwork of performance techniques is embedded into the culture.
Today, typical data management environments have several restrictions theyve "learned to live with in seeking acceptable analytical performance. Indexes, summary tables, cubes, utility executions like reorganizations and various denormalizations are often introduced into the environment. There's a point at which overcoming these challenges ceases to be easy and affordable and therefore ceases to get done. But perhaps more important than these very real restrictions are the limitations in thinking about possibilities and information exploitation.
Well-crafted data models as well as the performance tuning can be acceptably accomplished with expert data modelers and database administrators. However, these are highly mobile individuals, and tying environmental performance to their availability and cost is less appealing.
Data Storage Blocks
While results are what matters most when it comes to a database management system, having an understanding of what goes on at the molecular (i.e., bits and bytes) level is helpful in understanding the fundamental differences between traditional and newer approaches.
Data is ultimately stored as bytes. These bytes are grouped by the several thousand, from 4,000 to 64,000, into pages or blocks is the unit of input/output an exception being the prefetch capability of row-oriented database management systems sensing a pattern in the reads. In a row-wise oriented DBMS, blocks contain headers and footers comprising metadata that organize the information the block contains. The footer contains pointers to where the records begin in the block. Records can be fixed or variable length. Each column in the record, stored consecutively in a traditional DBMS, is comprised of enough bytes to represent the value. In the case of a variable length column, an extra 2 bytes precede the column which contain its length. In the case of a nullable column, an extra byte precedes the column containing a yes/no condition for the nullability of the column.
Data type determines the column length. For example, an integer is 4 bytes (and 4 bytes can contain numbers up to +/- 32,000). A character (10) is 10 characters since each character is a byte. And so on.
Records, which always contain a few, fixed bytes in the header, can be a few bytes up to thousands of bytes, but they cannot span the block. Some data types, like unstructured data, stored in file structures, can be pointed to from columns which are just pointers.
Block headers, on the other hand, contain valuable statistics about the block, such as the number of records contained, the timestamp of the last statistics run and so on. Blocks can contain records from multiple tables, and they may be interwoven, depending on the planned or unplanned co-location strategy presented to the DBMS in the data definition language.
The relational theory, invented by E. F. Codd in 1969, is one of the most powerful theories in computing, ever. It dictates the way the DBMS processes and the way organizations design the tables that are placed into the data storage blocks. Relational DBMS are an implementation of the relational theory.
The relational theory drove the development of the structured query language used to interface with the data storage blocks of the DBMS. It provides a much-needed level of insulation between the user (business or IT user) and the data storage blocks. It states that it does not functionally matter in what order the columns are within the table or in what order the rows are within the table. Whatever you ask for with SQL, you will get. The relational theory is also the foundation that allows for the join of data from multiple tables into a single result set, something that many queries, analytic and otherwise, need to do.
Mix and Match
Options available today for analytics do not have to be considered black box. A little knowledge can help a prospective buyer or someone considering rearchitecture understand how the solutions execute analytics and therefore what they will do best. The key architectural dimensions these days are row-wise versus columnar, the systems architecture, megavendor versus independent, delivery mechanism and packaging level. Options are available at many of the intersections of the possibilities across the dimensions.
A columnar DBMSs is an implementation of the relational theory, but with a twist. The data storage layer, discussed earlier, does not contain records. It contains a grouping of columns. For example, all the column one values are physically together, followed by all the column two values, etc. The data is stored in row order so the 100th entry for column one and the 100th entry for column two belong to the same row.
Columnar DBMS do not have a comparable restriction to the row-wise limit of records per page. There can be thousands of values in a block. Due to the variable column lengths within a row, a small column with low cardinality, or variability of values, may reside completely within one block while another column with high cardinality and longer length may take a thousand blocks.
The columnar block is much larger than row-wise and contains metadata as well, but much of it is dedicated to the compression dictionary, where the compression routines are located, and is ultimately very important for placing and locating column values within the block.
In columnar, all the same data your data is there. Its just organized differently. The main benefits include the ability to highly compress the data. Column orientation greatly assists a compression strategy due to the high potential for the existence of similar values in columns of adjacent rows in the table. The 50 percent average compression in columnar implementations is high by comparison to other DBMSs. Compression packs more data into a block and therefore more data into a read. Since locating the right block to read and reading it are two of the most expensive computer operations, you want to get the most out of it.
A byproduct of the compression is the ability to perform columnar operations like MIN, MAX, SUM, COUNT and AVG very rapidly. All the data needed for such operations are co-located. Likewise, the fewer the columns needed in a query (for both presentation and predicate analysis), the more advantages a columnar approach will to have. Processing that requires dozens of fields will not perform as well.
These are the major differences. The main reason why you would want to utilize a columnar approach is simply to speed up the performance of analytic queries. There is no difference in the SQL or data access tool used to interface with the data or the logical data modeling.
The columnar approach to data management is not new. In fact, the mainframe database Model 204, first deployed in 1972, uses a similar approach. However, the columnar time may have finally come.
Parallelism is key to query performance success, and massively parallel processing systems provide the most parallelism possible. MPP systems evolved from symmetric multiprocessing systems which, in turn, evolved from uniprocessor systems.
Uniprocessor systems hit their bottleneck in CPU processing as companies turned up the workloads to enterprise levels. The signal speed was limited by the speed of light, the circuit sizes were limited and higher performing processors cost exponentially more than their performance take-up. The hardware vendor response was parallel processing.
First-generation parallelism was SMP and included functionally specialized, data and workflow parallelism. The added processor(s) alleviated some of the CPU bottleneck, but created another one - in the bus that is used to connect the CPUs to the memory and the I/O unit which accesses the disk. Clusters allowed multiple collections of CPU, memory and bus, called nodes, to work together over an interconnect. With every advancement, companies were enjoying greater processing speed and throughput.
Disk may or may not be shared in the cluster architecture. Sharing nothing minimizes disk access bottlenecks and is preferred in MPP, along with an exponential uptake in the number of processors and the sophistication of the interconnect. MPP is generally the preferred architecture for analytic query, BI and data warehousing.
As the software industry consolidates, vendors are promising greater integration and lower costs. However, such efficiencies have been slow to emerge. For all the talk of software consolidation, the fact remains that some of the most innovative, cross-functional and most suited technologies are being developed by vendors not listed in the Fortune 50. At the same time, the acquisition strategies of some megavendors have benefitted from advances by independent vendors.
A well-crafted data management strategy can meet corporate needs in a scalable fashion utilizing the new options available and for much less. Options to UNIX-based traditional online transactional processing databases for analytics include many nonstandard architectures, which are providing options for data warehousing and the storage of large data. Dont be shocked or averse to the independent vendor becoming part of one of the large vendors plans in the future, however.
Data appliances have emerged as viable short-list solutions for new or refurbishing data management efforts. Vendors now dot the landscape (with visibility in analyst quadrants and spectrums and representing hundreds of millions in venture backing), and they require any data professionals understanding, attention and consideration.
The data appliance is a hardware/software/OS/DBMS/storage preconfiguration for data management requirements. Appliances utilize commodity components and open source DBMSs for a low overall total cost of ownership. These open source DBMSs provide a starting point for basic database functionality, and the appliance vendors focus on necessary functionality enhancements. Performance of queries, especially against large volumes of data, is distinctively impressive due to the automatic parallelism appliances provide. Low TCO for a mixed workload data warehouse environment is consequential with appliances as well.
Data appliances provide scalability at a lowered cost. Open source is an attractive alternative. For applications, software as a service options are emerging and the near future promises viable options of cloud, correlation, distributed computing and XML server frameworks. And columnar-oriented DBMSs, mostly coming from independent vendors, are claiming the analytical category. Some of the reasons to explore the options include:
- Greater integration with multiple technologies,
- Greater innovation,
- More competitive pricing,
- More flexibility to swap tools when a particular tool no longer meets a need,
- More attentive post sales service and support and
- More attention paid to the needs of mid-market businesses.
Finally, today, most large enterprise software vendors offer some type of pre-built reports or dashboards centered on typical business functions such as finance, sales and marketing and supply chain management. In addition, a number of smaller companies have developed packaged analytics built on licensed platforms and targeted to specific industries, applications, or specialized functions or to supplement other third party services.
While a true data warehouse in a box has remained somewhat elusive, analytic packages bridge the gap between off-the-shelf canned reports and a ground-up custom BI solution. Such packages simplify the development process and deliver functionality with more cost certainty. Determining if a package is right for a particular organization requires careful consideration and understanding of the various costs and benefits of the solution options. Knowing these factors can help decision makers make the best choice for their organization:
- Source applications,
- Timing of delivery,
- Budget considerations,
- Comprehensiveness of Needs and
- Business ability to adapt.
These days, the best answer to your large data and analytic platform needs may not be as evident as it was in the past. Given the continued high relevance of data and information, investment and innovation continues in this area, and there are a surprisingly high number of options. Many solutions may work, but at high short- and long-term prices. Developing a strategy consistent with your goals as an organization and with a broad and detailed view of the landscape is paramount to success.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access