Big Data Projects: How to Choose NoSQL Databases
So, you've succumbed to the buzz and now you're looking around trying to make heads or tails of the mass amounts of information out there hyped up as "big data." Or perhaps you’re even ready to start your own internal project to get your existing applications on the bandwagon. In either case, terrific! Your decision is a good one.
Unfortunately, now comes the flurry of potentially overwhelming questions:
- Where do I start?
- What are my expectations?
- What does big data mean to my company?
- What does big data mean in the context of our applications
- How do I assess my application needs?
- How do I know or determine if big data solutions will work for us?
After some online research, you'll quickly find that most folks are merely picking a place at the edge of the pool, dipping their toes in here and there to test the water temperature.
The reality is that it is incredibly difficult to define the term big data. Its meaning includes so much more than just storing and using large data sets. When you hear people referring to big data, they're actually referring to is the use of NoSQL database implementations to store and process large amounts of information.
Don’t be discouraged! In the following sections we'll get into what those NoSQL databases are and how to identify which is best, if any, for your project[s]. That's right; the goal here is to provide you with information that will allow you to draw the correct conclusions for your organization and your specific application[s] or project[s]. (In part two of this article, I explain how to migrate from relational databases to NoSQL.)
NoSQL databases aren't really databases. In fact, they are nothing like a traditional relational database management system (RDBMS). Instead they are implementations of various data stores which do not have fixed schemas, referential integrity, defined joins, or a common storage model. Also, they typically do not adhere to ACID principles (atomicity, consistency, isolation, and durability) and have sometimes widely varied technologies behind them. The term NoSQL (or Not only SQL) is intended to imply that many of these implementations also support SQL-like query capabilities.
In this big data market where the NoSQL database is king, there are more than 100 different offerings available in various licensed models. The fact that these non-databases vary is no accident. Each distinct implementation has different strengths, weaknesses, and generally accepted uses. However, the bulk of these break down into four major categories based on some common underlying characteristics -- as shown in this chart:
Choosing the Right Path
A heavy emphasis should be placed on the definition of your requirements. What are those? Well, that's a large discussion all by itself. However, I'll try to quickly paraphrase for the purpose of furthering this topic of discussion: Data requirements are artifacts captured during the process of defining application behavior with respect to gathering, storing, retrieving, or displaying information (data).
For example, in your application are you processing stock quotes, working with CRM data, or processing social information? There are different needs for different application types and thus a varied number of NoSQL implementations, not all of which are designed to be applicable to your needs.
It may be that after careful evaluation, you determine that your current RDBMS approach is valid and appropriate for your application. That's not a bad thing. Traditional RDBMS certainly has its place and will remain very relevant for business use well into the future.
You see, there's loads of confusion about this big data thing because there is no One Path concept. It just doesn't exist. What's good for one business may not be good for another even though they are doing similar things. There are many factors that go into selecting the right implementation and, honestly, not everyone is careful or critical when evaluating their needs.
So you still think that you might be better off migrating away from RDBMS to a NoSQL implementation. This decision isn't for the faint of heart. It requires real consideration and planning, which raises several additional questions:
Question 1: How do I know which implementation is correct for my needs?
Fortunately, there are certain high-level criteria that help us get the decision process started. One such is determined by answering this question: Is the application intensive with reads or writes (e.g. large numbers of transactions in an OLTP system or large numbers of reads in an OLAP system) today? If the answer is no, and we're merely dealing with volumes of information, we can automatically exclude items that fall into the "Column Families/Wide Column Stores" category. There are always caveats but that's a good, general rule of thumb. An initial litmus test if you will.
Let's test this with an example: We've got a compiled application that processes transactions for a global book seller. This application sees no fewer than 70,000 transactions a minute, 24 hours per day, 7 days per week. Which category of NoSQL database implementations fits? It's obvious, right? You bet! It's the "Column Families/Wide Column Stores" category.
Let's try another. In this example we've got a web-based application that enables title companies to enable secure signing of large numbers of documents. The transactional volume isn't large nor are the numbers of reads. Which category fits now? Right! It's the "Document Store" category. This is primarily because only a small amount of data is changing; the signatures and whole documents need to be stored.
I think you've probably got a handle on how to determine initial fits by category now. So long as we ask the right questions, with the right view of our data requirements, we should always be able to identify the right category to start trying to assess which implementations might meet our big data needs.
We're done then, right? We can pick any random implementation from the category we've identified, get it installed, configured and deploy our application[s] so we can start touting our Big Data story! Hold on there, we're not really done yet.
Question 2: What will we be using to communicate?
The language[s] your application[s] use to communicate is an important consideration when choosing the right NoSQL database implementation. In the earlier examples we were able to identify the right categories based on our knowledge of the requirements and the application usage. Now we need to narrow down the field of choices and zero in on what is likely to be right for our needs. Unfortunately we actually don't have enough information to identify, even at a high level, which implementation matches our communication needs. Yet!
We'll need help to answer this and other questions, so we'll employ the use of another table. In this table we've laid out a few of the most popular NoSQL database implementations, their protocol[s], API[s], licenses and replication models:
Clear as mud? Right! Basically what this table shows is exactly what you'll find on the web: Every NoSQL database implementation has its own way of getting data into and out of its store. Fortunately though, this is precisely what we need to match our requirements with implementations in our selected category because, in the end, our application[s] will need to know how to interact with it.
Shall we try another example?: This time, we'll use our high-transaction application for our global book seller which we've already matched to NoSQL database implementations in the "Column Families / Wide Column Stores" category. Now we take a look at our application's communication requirements. The application today communicates via an ODBC driver (a compiled binary). We could make some assumption that if we use one compiled driver, we can use any of them. That line of thinking is not un-heard of and would lead us to select Cassandra from the above tables because it resides in the right category and because of it also uses a compiled driver: its Thrift driver. A stretch? Perhaps a small one, but it's really close enough for the purposes of our discussion.
Can it really be that easy? Well, yes and no. There are many other factors that we should consider when choosing a NoSQL database implementation. Some of these will leverage information from the table above. The additional criteria can be determined by answering the following questions:
- Is this a commercial application? If so, is this for internal or external use? You see, there may be specific licensing restrictions that must be considered, especially in commercial, for-profit, applications.
- Are there existing deployment restrictions such as supported operating systems or others like single vs. multi-server, replication model, or even needing to meet specific backup requirements for the company's disaster recovery plan?
Not asking for or failing to give heed to these answers may cause your project to fail due to a poorly selected match.
I hope this has helped in some small way to eliminate some of the confusion around big data and what it might mean for your company.
Note: This is part one of a two-part contribution. Part II features tips for a successful relational to NoSQL database migration. The author's views on NoSQL selection do not necessarily reflect the views of Information Management.