Overcoming the Initial Challenges of Big Data
Joe Caserta is the founder of big data consulting company Caserta Concepts, which has worked with Liberty Mutual, Swiss Re and other large corporations. He believes that many organizations underestimate the complexities of moving from relational databases to Hadoop and other big data technologies. He says understanding exactly what technology can and cannot do is essential to the success of big data and deep analytic initiatives. In an interview with Information Management, Caserta discussed the initial challenges of big data and how to overcome them. He also shared some interesting thoughts about Hadoop and other popular big data initiatives.
For all the hype surrounding big data, how much confusion still exists in the marketplace?
The area of confusion is the actual functionality of Hadoop. It’s been sold to the masses [as a way to] replace relational databases and it very well may, but it doesn’t really behave like a relational database. So if someone is used to writing SQL queries in Oracle or SQL Server and then they migrate to Hadoop, it’s difficult to do that. It’s batch-based, and it’s very much like old-school mainframe JCL [Job Control Language] jobs.
It’s just very different. And I think that’s kind of the “gotcha” when people migrate from more traditional data warehousing to big data warehousing.
So what should organizations do? How do they start to overcome the challenges with big data?
Well, first of all, the term big data is kind of a misnomer. It really can also be thought of as just low-cost data. That’s really the main impetus for a lot of the implementations that we’re doing. Dollar for dollar, Hadoop is exponentially cheaper some say tenfold cheaper. [However, it would help to] set expectations that you are paying much, much less, but you’re not getting full functionality. And you need to be very clear [about] the functionality you do get and the functionality that you won’t get.
The challenge is that every day the functionality changes because Hadoop is advancing all of the time.
Its biggest drawback for the last year or two has been that it didn’t support interactive queries. But with [new releases of] Hadoop 2.0 and [the] YARN [operating system] as well as the maturity of things like [the] Impala [massively parallel processing SQL query engine] and Stinger [query performance-improvement effort], you can actually do interactive queries on Hadoop.
Education is probably the biggest thing [needed] to improve [people’s] understanding because Hadoop is really a completely different animal. You deal with your data very differently; the languages that you use are very different.
Where do skilled Hadoop people come from? What kind of background do they have are they applications people or data people?
Application people and data people are two different personalities. A lot of the big data technologies are really born from application guys who are trying to deal with very, very, very large volumes of data. A data person will just try to optimize the databases that he can [in order to handle the volume].
But because many of the big data technologies are born through app guys, such as Java and Python programmers, [newer technologies like] Pig and Hive and those types of languages are easy for people who are application developers to pick up. If you’re a data guy used to dealing with Oracle and SQL Server, it’s kind of like a foreign language to learn Java, MapReduce and Python. You just don’t know these languages; it’s completely different.
So I think having training for the different technologies that people have to transition to is really important.
Finding value and insights within the data is the key with big data initiatives. Do you have use cases or examples of big data in action that stand out as interesting to you?
Roughly 50-60 percent of our projects right now are big data and the rest are traditional data warehousing. And the initiative behind every single one of them was not just a need to handle the volume; 100 percent of the projects were launched because of the need to be able to perform better analytics. And with tools like Python and MapReduce and Java and Pig and Hive you can do really deep analytics.
The thing that’s really nice about big data and probably the biggest benefit of the big data paradigm is that [much of it is] open source, so a lot of the algorithms for machine learning and for a lot of the data science routines are readily available. You don’t have to write them from scratch. So you can use a tool like Mahout [machine learning software] and build a recommendation engine without being a data scientist. We’ve done a few of those types of projects where it’s very much like Amazon’s “If you like this you might like that.”
You can take that logic and apply it to anything. If you like the characteristics of this employee, you could look at other candidates and it could recommend who might be the best recruit. You can do it for stocks: If you like the characteristics of this stock and the performance of this stock and also the characteristics of other people looking at that stock, it can make recommendations for other stocks to look at. That’s probably the biggest use case that’s common across our projects it usually includes some sort of recommendation engine or some sort of machine learning.
You lead a big data meetup in the New York City area. What are these about and what kinds of discussions take place at the meetups?
We have 1,154 members, so it is a really big and very, very enthusiastic group. We meet on a weeknight on everyone’s own time, and we pick a topic that is usually something we’re working on at the time. We share our experience so that we can make the community a little bit smarter and the community is pretty damn smart already. We make it interactive; we’ll share what we’re learning. We also open it up to the floor and get people to comment on their experiences. Collectively, we all walk out of the group a little bit smarter than when we walked in.
It’s open to anyone, and it’s completely free.
What else do you see happening?
One observation is the notion of big data solutions replacing traditional data warehousing solutions. If we had this conversation a year ago, I’d say that’s probably going to happen. But that transition is happening much, much slower than I originally predicted and quite frankly I’m not quite sure some things have to change in big data for that to actually happen.
What’s happening now is that [people aren’t choosing big data solutions over traditional solutions]; it’s really becoming a hybrid solution, and the whole notion of a data lake is being realized.
The data lake is really a central repository for your data. It’s lightly structured and lightly governed. And then from there the data scientists have access to it. They’re gleaning insights off of semi-raw data. But then from there they’re actually fully preparing it, fully governing it and fully structuring it and the result is what exists today as the data warehouse.