One of the more interesting concepts recently in the world of NoSQL is the push to bring SQL access to flexible multi-structured data sets in NoSQL data management systems or, more simply, put SQL on Hadoop, to give a specific example. Some consider this to be the next great evolution in the future of NoSQL platforms. Some consider it to be a fool's errand to attempt.
Here are some concepts to consider when you think about SQL on Hadoop: the degree/development of SQL being applied to NoSQL, and why NoSQL was named No-SQL in the first place.
Degree/Development of SQL on NoSQL
When you start talking about SQL, some people think it’s black and white. If you are NoSQL, you don't support SQL. If you do support SQL, then you SUPPORT SQL. This means in the minds of many business stakeholders (and even some CIOs) that you can simply connect your existing business intelligence platform up to Hadoop or MongoDB or Cassandra with no hassles, issues or concerns. If you read some of the marketing materials, it is as if "It slices. It dices. It juliennes..."
Yet, when you read the technical information about the SQL available for some SQL on NoSQL platforms, you will find that it is sql and not SQL. The major standards-based versions of SQL are SQL-92, SQL-99 and SQL-2003. Every vendor has its own take on SQL, but for the most part the core syntax and extensions of the "SQL"s is standard (i.e., what I mean when I say "you SUPPORT SQL"). However, the SQL on NoSQL is barely to the point where SQL-92 "compliance" can be achieved. Yes, you can write "select * from tablename". No, you cannot connect your Cognos or BusinessObjects instance to your NoSQL platform without an issue... "It slices... It might dice... It won't julienne."
SQL on NoSQL?
While I am not against the concept of putting a SQL access layer on NoSQL data management systems, there is a reason why NoSQL was "named" NoSQL. Most NoSQL data management systems support the storage and management of multi-structured data sets. Not only are these data sets multi-structured, but they tend to be flexible in their "multi-structured-ness." For example, JSON is a popular data format. However, it is not a standard structure. You can add tags into a JSON document easily and at the will of the individual creating the structure. This is the beauty of NoSQL data management: You aren't locked into the schema of table for storage. If you want another field/tag, just add it to your JSON document structure, much like when you want to add emphasis to a blog posting you simply add a tag to your text - you don't need to consult Strunk and White for style, you just do it.
SQL requires an awful lot of metadata on the structure that you want to access in order to write a query. Until just recently, NoSQL data management has been focused on storing files (and flexible ones, for that matter) rather than storing metadata about the structure of those files. Managing the "table"/file metadata on an ongoing basis can be difficult particularly when application developers who aren't using SQL database for some reason decide to add a tag or remove a tag based on the moving targets of their personal choices errr ... of the changing requirements of the business. Metadata tracking and management is a task that most SQL data management systems have been tackling and "perfecting" (I use that term loosely) since, well, 1992 with the SQL-92 standard, as stated above. Simply put, just because you want a SQL access layer on a NoSQL, doesn't mean that you can do it in 6 to 12 months.
Short-term vs. Long-term
Again, I am not against the concept of providing SQL access on NoSQL data management platforms such as Hadoop, MongoDB or Cassandra, to name a few. Over the past couple of months, I have seen some excellent developments for each of those platforms... and many others. Yet, in the short-term, SQL on NoSQL will be something that should be handled carefully, and expectations should be managed with both technical leadership and business stakeholders. Access layers that are barely SQL-92 compliant shouldn't be expected to meet the expectations of the Financial team's MDX queries and visualizations.
In the long-term, I think the best chance for SQL on NoSQL comes from the ability to store structured data in multi-structured environments much more easily than multi-structured data can be stored in structured environments. I also think that developments in the open source community and their associated commercial versions will push the envelop to allow more mature SQL access methods on NoSQL platforms just as they have pushed storage of multi-structured data over the past 3 to 5 years.
What say the readers?
- Do you like your HIVE access to Hadoop?
- Do you agree that 'sql lite' isn't SQL-2003 and shouldn't be compared that way?
- Have you run CQL queries on Cassandra just as your Financial analysts want you to?
- Do you want your SQL to slice, dice and julienne?
Provide your comments below and/or ping me via twitter at @JohnLMyers44 with the hashtag #noodlingNoSQL.