Hadoop for Operational Business Intelligence
Business users work under tight deadlines to provide answers to pressing business questions. Yet in today’s dynamic business environment, data and analytic requirements are ever changing and users never seem to have access to what they need when they need it. Even when data is available, it can be weeks or even months old and often lacks useful detail. The overstretched IT department simply can’t keep up with user demand and the backlog of requests just gets worse as data from inside the enterprise and from new social media sites like Twitter grows. While business users wait, companies miss opportunities that could have positively impacted their bottom line.
Operational business intelligence attempts to address these types of information delivery issues by allowing reporting and analysis to be performed within a very short window of time when business transactions have occurred. As the complexity of traditional data warehouse platforms makes it nearly impossible for IT departments to accommodate requests for operational BI in a timely manner, many organizations have turned to alternative approaches. Historically, in addition to replicating front-office data stores for BI purposes, organizations have achieved operational BI either through overnight batch reporting jobs run directly against their online transaction processing systems or through creating an operational data store that contains some limited history from transaction processing systems.
However, even these commonly used techniques for operational BI are challenged by rapidly evolving business requirements, growing demand for granular data and the increase in unstructured data. Inspired by the Google File System and MapReduce, Hadoop is an open source framework best known for supporting data-intensive distributed applications but, for a number of reasons, will most likely become better known for its ability to facilitate the requirements of operational BI.
One of the many challenges associated with operational BI is that the IT department simply doesn’t have the resources and budget to keep up with constantly changing data requirements in today’s dynamic business world. Hadoop readily accommodates change because it stores raw data in any format, and therefore does not require complex data and schema mappings that are time-consuming to set up and difficult to maintain. Performance is achieved through scaling out elastically rather than the painstaking work of tuning hard-coded schemas used in data warehouses or by limiting the history stored in an ODS to avoid poor performance. This elastic scalability offers the ability to expand and contract nodes as demands change. In addition, the Hadoop Distributed File System is an affordable alternative to more expensive database storage.
Increasingly, companies are realizing that true opportunities to improve the bottom line come from a detailed understanding of business operations, such as individual customer needs and how well they are being served. As a result, demand for more granular data over longer periods of time is growing. Granular data might represent a single event, such as a customer purchasing a product at a given time of day for a specific price, or even details about the customer’s clickstream that led to the purchase. Summary data, on the other hand, might aggregate this data to total product sales by all customers for the day, week, month, quarter or year.
Operational BI usually requires the storage of highly granular data, not just data summaries. Yet, due to scale restrictions and performance implications, organizations must either limit the time period for which they can store granular data in their ODS or settle for some degree of summarization and latency of data availability in their data warehouse. In contrast, HDFS is designed for the storage of massive amounts of granular data, making it ready for just-in-time reporting . It provides high data throughput and scales to hundreds of nodes in a single cluster. In addition, it supports tens of millions of files in a single instance.
According to Gartner, enterprise data will grow 650 percent in the next five years – 80 percent will be unstructured. Unstructured data refers to information that either does not have a data model or has one that is not easily usable by data warehouse applications. Common examples include Word documents, video and audio files, call detail records, clickstream data, log files and email.
With the greater ubiquity of unstructured data, there is the growing realization that operational BI should provide access to both structured and unstructured data as well as an integrated view of both types of data. Still, most traditional technologies that store data for operational BI are optimized solely for structured data stored in relational database tables. One of the distinct advantages of Hadoop over relational database storage is that it’s designed for the storage and processing of both structured and unstructured data because it does not impose a data model on information.
The Hadoop framework has earned its stripes for big data scalability, but flexibility as an ODS is yet another area worthy of exploration when an organization is having trouble keeping up with dynamic business requirements on top of meeting needs for granular data and unstructured information. Using Hadoop, users just load and store raw data, either structured or unstructured, and do translations on the fly.