Last December, my column discussed columnar databases. When someone quoted it recently, I realized it had already become outdated, because a new generation of vendors, including Vertica Systems, ParAccel, Calpont and Infobright, has joined older columnar systems from Sybase IQ, Alterian, smartFOCUS, Kx Systems and 1010data.

In general, the new systems assign dedicated disk drives to each processor (shared-nothing) while the older systems apply multiple processors to a shared storage pool. Each approach has its strengths and weaknesses, which introduce new differences to consider. In addition, the broader adoption of multiple processors and 64-bit memory removes some of the performance constraints that impacted earlier systems. First I’ll revisit the original list of items to see which are still relevant, and then I’ll add a few new ones.

Load time, incremental loads and data compression. These reflect the need for a columnar database to restructure data originally stored in another format. They can be critical bottlenecks at large data volumes and older systems varied widely in their performance. As a result, these were the most important considerations when comparing older columnar systems.

Today, multiple processors, larger memory space and more scalable disk storage have greatly improved load and compression rates in nearly all columnar systems. Substantial differences still exist, but performance of even the slower systems is likely to be adequate. As a frame of reference, leading columnar databases several years ago loaded around 10GB per hour, while today’s best products load 150 to 200GB per hour. Many can reach whatever load rates are needed by simply adding more processors. Additional processing power also allows greater compression of stored data, because systems can decompress it more quickly after it is read from disk. (Decompression is not always needed; many operations run on the compressed data directly.) Bottom line: you still need to consider load and compression performance, particularly if you’re dealing with terabytes of data or need quick incremental updates. But these issues no longer head the list.

Structural limitations. Some early columnar databases imposed significant constraints on data structure, such as requiring that all tables use the same primary key. These crude limitations are largely gone. However, some of the newer systems do have more subtle limits, such as performing better on star schemas than normalized architectures. If you expect to use a star schema anyway, you probably won’t have a problem with any modern columnar system. But if you use other structures, carefully check how well a given product will perform on them.

Access techniques.While many early columnar systems were not SQL compatible, all of today’s products offer some level of SQL access. (Some still offer their own language for functions that SQL handles poorly, such as time series analysis.) Still, there are many levels of SQL compatibility. You’ll want to dig into the details for each system, particularly if you want to reuse existing SQL queries or SQL-based reporting tools.

 

Performance. This is one issue that hasn’t changed. Columnar databases are all fast, but performance on particular tasks can vary substantially from system to system. Performance may also depend on system configuration, so it is especially difficult to test. But performance is probably why you’re considering a columnar system in the first place, so you’ll certainly want to be sure you know what you’re getting.

Scalability. Any columnar database you’re likely to consider will handle a couple terabytes of data. But not all are proven at the 50 or 100TB level. In addition, some systems are significantly better than others at handling mixed query types and large numbers of simultaneous users. If you have needs like these, make sure your chosen vendor has similar installations in production or that they can demonstrate the necessary performance in a realistic test.

New Considerations

None of the old issues have vanished, but the frame of reference has shifted for many. In addition, some new considerations exist:

Fault tolerance. Many of the newer systems store data redundantly, either within or across nodes. This is done largely for performance purposes, but it can have side benefits of easy - possibly even interruption-free - recovery from hardware failures. Many columnar databases are used for analytical work, where some downtime is acceptable. But if it matters for you, be aware that products differ substantially in this area.

Data types. Columnar databases have traditionally analyzed conventional structured data. A few also support XML, text analysis and even binary objects. As with fault tolerance, you may not need this, but you should know that it’s available.

Database administration.One of the traditional benefits of columnar databases was their simplicity. Basically, users dumped in the data and the system organized it for them. It’s still possible to work this way, but many systems now provide options such as multiple index types or sort sequences. This lets users tune the system to their requirements, but also means database administrators must make the right decisions. It’s still true that any columnar system should be easier to manage for a given analytical application than a relational database, but you’ll want to assess the differences in administrative workload among the columnar products themselves.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access