The Enterprise Data Warehouse at a Crossroads
After a slow but substantial wave of adoption and maturity over the last two decades, data warehousing is facing a backlash from stunted initiatives and the outright call of its demise from unstructured data streams and big data demands. Joe Caserta heads up his own enterprise consultancy on EDWs, where he finds interest in EDWs strong but changing dramatically over the last nine months. In the hardcore warehousing community, Caserta is also fairly well-known alongside Ralph Kimball as the authors of the static industry resource, “The Data Warehouse ETL Toolkit,” in its 10th year in publication and selling “steadily” in print and digital formats. Information-Management.com Senior Editor Justin Kern recently discussed the rapid change underway for the EDW, the instances where it remains a strong fit and how warehousing has to evolve to stay vital for business data.
Information-Management.com: Let’s start by anchoring some of the more conceptual and theoretical thoughts on the progression of data warehousing and the EDW with your “state of the enterprise data warehouse.” What are business clients asking for on the warehousing front and how has that really changed over the last year or so?
Caserta: There are still so many different maturity levels when we talk about data warehousing. There are clients who have had them in place for decades. Then there are others that are newer. I just got off the phone this morning with a business making their first foray into getting data out of their system and organizing it. It runs the gamut across industries, across environments, and there’s no ‘secret sauce’ with the maturity level. Some were early adopters and some resisted. Thankfully, the thing with taking the plunge into big data, there are those who take the traditional data warehousing route who are starting to hit a wall. Then there are those businesses who never built a traditional data warehouse who wonder ... if they should dive right into a big data solution.
What are the options you’re telling people when the big data discussion comes up? And what is working and what’s not?
We were recently invited as consultants to a company who didn’t have any data warehouse at all. They asked, ‘What if we just put the entire thing in MongoDB? Would that work?’ It’s a challenge to find a general answer for clients. I like to use the concept of ‘crawl-walk-run.’ It depends on how you define your benchmarks, but, in essence, just getting your data across the enterprise, consolidated in a form where you can use it, that’s crawling. Before you can do the other two, you need to have some governance and have your data established in a way where you can get at it. Then, it depends on your sources. If you’re a dot-com business and you’re streaming unstructured data sources, maybe there’s really no reason to go the traditional route. ... And one of the things not always considered up front is the size and ability of your IT. If you’ve got one or two guys running the show, you probably don’t want a big data solution because it’s so bleeding edge and there is a lot of work to be done. As far as development time alone, it takes a while to write all that code. Hub-and-spoke interfaces exist with the traditional data warehouse components.
Some of the strengths of the data warehouse, then, are the talent levels and the maturity of best practices and approaches. But the data warehouse, as you’ve certainly heard and seen, has attained a reputation of late with consistent problems or outright failures. Is that pushing people more on the side of at least trying the ‘bleeding edge’ Cassandra or Hadoop clusters rather than the stickier EDW?
Why data warehousing has or doesn’t have a bad reputation could probably be another conversation entirely. But I think finding talent and reputable software ... is a main part of what businesses do consider. We just had a client who went with a known MPP database over an open source database purely because the MPP database had a financial history, it was stable and it’s been around for years. If they take the time to invest in that database, it’s going to be around for a while. The big data software is essentially [from] startups. It’s anybody’s guess if they’re going to survive or not. ... If it’s my business and I’m building an analytics platform, would I go with something tried-and-true or something on the cutting edge? That’s the business’ decision on risk. I think from the clients’ point of view, it depends on how much risk you can afford, and if you want to be viewed as visionaries in this space, with the thought that investors will like you more for that thinking. If I’m a developer and I do this type of thing for a living, do I stick to my guns and grind my heels into the ground about being a traditional data warehouse guy? You’re in a business where your livelihood is providing solutions. Absolutely, 100 percent you have to start taking this big data paradigm seriously. It’s great to know SQL, it’s great to know databases. But you should put your money where your mouth is in the technology space and get yourself and others caught up.
Especially when you have the business side driving some of the interest in more data-driven decisions and analytic processes. Compared with the formative years of data warehousing, you didn’t have the business-side echo chamber on adoption like you do with big data.
What’s very obvious over the last nine months, we have business and IT repeatedly coming to us saying, ‘We want a big data solution. Can you help us find a use case for it?’ We’re helping them do that and we’re doing that along POVs (Proof of Value) to find if it’s really more valuable to go with a big data solution versus the traditional data warehouse solution.
That puts the consultants and vendors in a spot where, to be honest, they’re being paid by the companies to do this and you’re not going to turn down business. What’s your approach to throw caution to big data expectations, through the big data solution or EDW?
The big thing is resources. We can build a big data warehouse solution for them. For traditional data warehousing solutions, DBAs, SQL developers, ETL developers, they’re more and more common. Finding people who have a deep understanding of Hadoop and who know the landscape of integrating NoSQL databases with a big data paradigm, who know Hive or Pig languages, who are fluent in R or machine learning tools ... this is a whole, complete different set of skills that most organizations don’t have. And those resources are scarce.
On that other side, how are the ETL tools or the capability to use those tools advancing to keep the data warehouse relevant?
Now, what we’re doing is introducing new sources of data, so getting the data is different. We’re not working in only SQL anymore. And then where we store that data in clusters is different because we really don’t care about the structure. In the traditional EDW, the structure is so important that if you don’t have a proper data model, it’s never going to perform. With Hadoop, you throw the data in as it comes and you give it structure as you’re reading it. It’s a completely different way to think about data, and it takes people who’ve been doing it the traditional way for years some time to get their heads around. There is still that data movement, out of ERP systems or transaction systems, and that all requires ETL. Getting that data into Hadoop still requires ETL. And once it’s in Hadoop, you’re still continuously manipulating the data, only it’s not in SQL anymore, you’re using Pig or Python or straight MapReduce. All of that I still consider ETL. You’re extracting the data, loading it in Hadoop and then putting it somewhere else. It’s more of a conceptual change, but at the root you’re transforming data for usage, which won’t go away.
So what does change? In the tug of war over the role of the data warehouse against the overall changing data management landscape, there will be some significant evolution over the next few years. What do you expect and hope that happens with the EDW and its users in the near future?
Breaking the barriers of relational database management systems for analytics, that’s something I see developing in full within five years, if not three. All of the overhead of a relational database is about getting data into it. As we all know from traditional data warehousing, it’s all about getting data out of it. We’re going to move out of those relational databases. Secondly, what should be happening more, is a trend around users, especially the people who don’t know much about data and don’t exactly need to know much about the data but need answers. Google knows, depending on the question, what type of info graphic to render the data in and it automatically does it. I think if you can do that across the world’s data, my god, you should be able to do that within an enterprise. Any user who doesn’t know anything about the data side should be able to go into their enterprise interface and type ‘regional sales’ and then have a chart pull up. We should not have to ask people to construct queries for everything. Finally, I expect the maturity of BI tools that are able to fit into Hadoop directly, probably via in-memory databases. As in-memory starts distributing across the cluster and becomes scalable, the latency part kind of goes away. Then, like the old days when they started putting cubes on desktops like BusinessObjects and Cognos, it’ll be that same type of thing. You can create that query out of Hadoop, throw it into an in-memory database and just siphon the data directly.