How to Judge a Columnar Database

  • November 29 2007, 5:01pm EST

Columnar databases are an idea whose time has come - again. First introduced inthe 1970s in (still available) products such as Model 204 and ABABAS, thisapproach has resurfaced recently at Vertica and to some extent at QD Technology. These join resurgentcolumnar database vendors Alterian and SmartFocus, whose original products dateto the 1990s.

As the name implies, columnar databases are organized by column rather thanrow: that is, all instances of a single data element (say, Customer Name) arestored together so they can be accessed as a unit. This makes them particularlyefficient at analytical queries, such as list selections, which often read a fewdata elements but need to see all instances of these elements. In contrast, aconventional relational database stores data by rows, so all information for aparticular record (row) is immediately accessible. This makes sense fortransactional queries, which typically concern one record at a time.

Columnar databases were largely eclipsed by relational systems in the 1980sand 1990s, when huge improvements in hardware price/performance allowedrelational databases to compete effectively despite their analyticalinefficiency. Columnar technology may be re-emerging today because analyticdatabases are now so large that the hardware required to use relational systemsis too expensive even at current prices.

Today’s columnar systems combine the columnar structure with techniquesincluding indexing, compression and parallelization. But the fundamentalquestions asked in evaluating these systems are the still the same.

Load time: How long does it take to convert source data into thecolumnar format? This is the most basic question of all. Load times are oftenmeasured in gigabytes per hour, which can be unbearably slow where tens orhundreds of gigabytes of data are involved. The question often lacks a simpleanswer, because load speed can vary with the nature of the data and choices madeby the user. For example, some systems can store multiple versions of the samedata, sorted in different sequences or at different levels of aggregation. Userscan build fewer versions in return for a quicker load, but may later pay a pricein slower query times. Realistic tests based on your own data are the best pathto a clear answer.

Incremental loads: Once a set of data has been loaded, musteverything be reloaded every time there is an update? Many columnar systemsallow an incremental load, taking in only new or changed records and mergingthem with the previous data. But close attention to detail is critical, sinceincremental load functions vary widely. Some incremental loads take as long as afull rebuild; some result in slower performance; some can add records but notchange or delete them. Often incremental loads must be supplemented periodicallywith a full rebuild.

Data compression: Some columnar systems greatly compress the sourcedata so the resulting files take up a fraction of the original disk space. Theremay be trade-offs: uncompressing the data to read it can slow performance. Othersystems use less compression or store several versions of the compressed data,taking up more disk space but gaining other benefits in return. The mostsuitable approach will depend on your circumstances. Bear in mind that thedifference in hardware requirements can be substantial.

Structural limitations: Columnar databases use different techniquesto mimic a relational structure. Some require the same primary key on alltables, meaning the database hierarchy is limited to two levels. The limitsimposed by a particular system may not seem to matter, but remember that yourneeds may change tomorrow. Constraints that seem acceptable now could preventyou from expanding the system in the future.

Access techniques: Some columnar databases can only be accessedusing the vendor’s own query language and tools. These can be quite powerful,including capabilities that are difficult or impossible using standard SQL. Butsometimes particular functions are missing, such as queries that compare valueswithin or across records. If you do need to access the system with SQL-basedtools, determine exactly which SQL functions and dialects are supported. It’salmost always a subset of full SQL; in particular, updates are rarely available.Also be sure to find whether performance of SQL queries is comparable toperformance with the system’s own query tools. Sometimes the SQL queries run agreat deal slower.

Performance: Columnar systems will usually outperform relationalsystems in nearly all circumstances, but the margin can vary widely. Queriesinvolving calculations or access to individual records may be as slow or slowerthan a properly indexed relational system. Create a set of sample queries andtest them against a prototype system.

Scalability: The whole point of columnar databases is to get goodperformance on large databases. But you can’t assume every system will scale totens or hundreds of terabytes. For example, performance may depend on loadingselected indexes into memory, so your hardware must have enough memory to dothis. As always, first ask whether the vendor has existing systems running at ascale similar to yours and speak to those references to get the details. Ifyours would be larger than any existing installations, then be sure to testbefore you buy.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access