Buzzword: “Column-store database”

There has been copious amounts written recently about the advantages (and disadvantages) of column-store databases, so I thought I’d do a little research to find out what the noise was about.  After all, SybaseIQ has been around for a decade now, touting the benefits of column storage and compression.  Vertica seems to be making the most noise out there now, with Michael Stonebreaker leading the charge.  But there are a number of column-based vendors out there these days (see my data warehouse appliance spreadsheet), among them Kickfire, Calpont, InfoBright, and ParAccel, so this is obviously not just a lone-wolf situation.

Breaking this down, the concept is really simple - you are storing your data in what you would typically think of as an “index” in the row based world.  Storing your data in this manner gives you two big advantages:

  1. You can achieve much higher compression rates, since the likelihood of encountering repeating values within one column is much higher than within a row.
  2. For typical analytical queries access a small number of columns, you can skip all the other columns entirely which provides a huge performance boost.

As with any technological approach, there are downsides:

  1. In the case of operational reporting, you can actually see performance degradations since you’re typically reading across rows as opposed to down columns.
  2. Writing data in a row format is quicker, which is important in a low-latency reporting environment where trickle-feeds or other near-real time updating is required.

This discussion is similar to the “dimensional model vs 3rd normal form” debate.  I don’t think there is a right or wrong answer.  You need to understand how your users are accessing the data and the loading requirements before making a decision.

Leave a Reply

You must be logged in to post a comment.