Big Data Changed the Way We Think About Data Warehousing
Expectations from organizations have changed the demand for data.
The change requires organizations to gather and manage very large data sets – terabytes and petabytes of data – for processes and near real-time analysis and predictive analytics, and is due, in part, to a society that is increasingly fast-paced. This big data challenge impacts data warehouses and the way we think about them. It was once considered that day-old data was a good unit of measure for analysis. However, Internet traffic trends can vary hour to hour or even minute to minute. For example, online advertising may miss a bulk of customers if they are using Web traffic trends that are a day old when trying to market to customers. If something goes viral during the day, it may only be relevant for a couple of hours or days and may become stale very quickly.
New technologies try to keep pace with the ever-increasing demands. These technologies create new opportunities and markets for data analysis. Vendors have created hardware and software to help support organizations’ initiatives for big data. We need to consider how these new technologies fit in our data warehouse environments, because trying to redesign the data warehouse after it is created can be costly.
Organizations can gain a competitive advantage if the decisions they make are based on data in the warehouse that is available, relevant and up-to-date. However, data warehouses have changed since they were initially implemented at organizations, and how we think about storing larger data sets has changed as well. Traditional data warehouses have a reputation for being expensive, large and slow with refreshes – some are still only refreshed once per day. Day-old data is now considered stale for some organizations. In order for organizations to discover trends that are occurring in real time, organizations need data to be refreshed more frequently and in smaller periods of time. Data warehouses must become more agile to support this real-time analysis.
A New Mindset
This new mindset has forced data warehouse architects to rethink what used to be best practices. Questions include: How should we load the data warehouse: ETL versus ELT or mixed TETL? How should we store the data primary keys versus distributed keys? SQL or NOSQL or not only SQL? Structured data or unstructured data? Conceptual models or traditional data models? Columnar databases or relational paradigms? What approach should be taken with data warehousing when it comes to big data?
The traditional data warehouse method of dimension modeling, with transaction data stored as facts and dimensions, and normalized modeling, using normalization methods, are being reconsidered. Dimensional models were being used for better performance on query operations, but they require complicated data loads to maintain data integrity that are time-consuming. It can be difficult to modify and maintain the data warehouse once it is implemented and the business decides to change. Loading the data warehouse had to become quicker; now the traditional ETL processes that loaded the data warehouse are now ELT processes, which allow for quicker load times and for processing locally after the data has been transferred. On the other hand, normalized models have simpler data loads and are easier to maintain for data integrity. However, performance is hindered by the need to join multiple tables when the data is queried.
Architects need to reconsider traditional approaches to data warehousing. Ralph Kimball and Bill Inmon offer two different philosophical approaches to data warehousing. Kimball’s approach puts data marts together to create the data warehouse, and Inmon’s approach creates data marts from the data warehouse. However, which one is better for big data?
Some would believe that Kimball’s approach may be better for big data analysis because it breaks up the data into multiple data marts that serve the needs of multiple business units, as opposed to extracts from one large data warehouse that create individual data marts, which may be a time-consuming process because of the volume of transferred data. However, Kimball’s approach may require more storage because of the overlap of data between data marts. So instead of thinking of just an upward versus downward approach, architects have to think about questions such as: How much data needs to be stored? How fast can the data be retrieved for analysis? How long will the data be stored and how much time will it take to analyze it?
Trade-offs are needed, and the best solution for the new generation of data warehouses may depend more on organizational needs than on the technology or data warehousing practices. There is also the possibility that data warehouses may not be the best solution to the big data challenge. Architects need to keep open minds as the next generation of data warehouses progresses.