ETL and data modeling requirements sessions for analytics (e.g., dashboards, reports, etc.) with IT stakeholders often start with the following conversation:

IT: “What data do you require for your analysis?”
Business: “What types of information do you have?”
IT: “We have lots of data, but we need you to be more specific…”
Business: “Well, if you could be more specific on what information you have… I could be more specific with my request for information…”

I call this being wrapped around the axle of “What do you need? What do you got?” requirements gathering.  It can be “fun” to watch this process; it can also be VERY painful to experience. And this is the process for simple things like customer, product, profit and revenue - concepts where we know the questions that we are searching for answers to.

Savvy business analysts usually do some searching via SQL and “spelunking” around data. Veteran IT teams will do some business-level reconnaissance to get around this problem. Now, imagine the conversation above when we don’t know the questions or don’t even know that there are questions to be answered:

IT: “What data do you require?”
Business: “I don’t know… can I have a bunch?”
IT: “What’s a 'bunch'?”
Business: “Not sure, but I will let you know when I find it….”

That thud you heard was the IT team’s collective head hitting the desk at trying to figure out what a “bunch” of data is.

Not Getting Wrapped Around the Axle

Software vendors and business stakeholders have “solved” this problem by inventing self-service business intelligence, discovery analytics, and exploratory workloads. Each of these describes a new kind of analysis, a brand of analytics where users are not limited by IT constraints associated with data acquisition and data preparation. Discovery (as I will call it from here) is literally the concept of attempting to answer the questions we don’t know that we don’t know without having to accurately describe a “lot” or a “bunch.”  You just connect up data sources to a discovery platform and go exploring for answers and, better yet, questions.

Why NoSQL is Important

If you accept my premise that discovery is the unearthing of the answers to the questions that we don’t know we need to answer, then this section will seem a little odd, but here we go.

I like to give the example of the walled garden.  A walled garden can be a very small and confined place such as a building courtyard.  A walled garden can be a very large and expansive place such as a botanical garden or arboretum.  Both can offer a level of discovery and exploration.  However, when you bump up against the wall, you quickly discover the questions that you can’t answer, such as “How does this customer data link to social information?” or “What path was taken to get this webpage that caused a customer to not purchase a product?”  In the world of discovery, the wall is the “box” that the data is placed within while you are exploring.  SQL platforms by their very nature have a “box” around the information they contain.  This is the S(tructured) in SQL.

In discovery, you don’t want to hit walls; you want to explore to your heart’s content, or at least until there is no more data to search, explore and discover.  With NoSQL platforms, the walls tend to disappear. NoSQL platforms don’t require the level of structure that traditional SQL platforms require to store information.  Many require no “real” structure (side note: all data has some structure) to be stored.  Or NoSQL platforms allow for variable structures such as JSON or key-value to be used.  This ability to store data without the “box” allows us to avoid the issue of leaving out raw data that is placed in a SQL platform or having to ask IT for new data.

My Advice to You: Start Discovering Heavily

When you use the analytical process known as discovery, I recommend that you look for tools and environments that allow you connect to NoSQL platforms such as MongoDB, Hadoop’s HDFS, Cassandra and CouchDB.  Lack of  ability to access these multi structured data sets will only place limits on your exploratory process and limit the number of questions you can discover that you didn’t know.

What say the readers?

  • Have you gotten good/great discovery results from environments that feature only SQL-access information?
    Should we eliminate the use of structured data in exploration because it is too limiting?
    Did you catch the movie reference in section 4?

Provide your comments below and/or ping me via twitter at@JohnLMyers44 with the hashtag #noodlingNoSQL.