Harnessing the value of machine learning in logical data warehouses
Logical data warehouses have been gaining popularity for some time, and the potential benefits of LDWs are many. They help provide a unified infrastructure for querying across disparate data sources. They also help provide security and metadata management on top of multiple analytical data management systems.
The LDW is needed because, to make the data actionable, organizations often have to run multiple analytical systems since data volumes today are too big and distributed to fit into a single physical system and also because analytic needs are too diverse to be solved by a single engine.
LDWs can not only help traditional analytics, they can also aid artificial intelligence (AI) and machine learning (ML) initiatives by providing agile, governed access to data no matter where it is located or how it is formatted. Not only do they greatly help with data discovery, LDWs also provides an easy way to expose reusable “logical” data sets to data scientists, so they do not need to take care of complex issues such as data combination from several data sources, complex transformations and performance optimization.
Recent studies have found that data scientists spend roughly 80 percent of their time finding, combining and preparing data for analysis, so this is no small feat. Finally, LDW technology can offer an easy way to publish the results of ML efforts for business users and applications.
Three Steps to Establishing a Unified Infrastructure for Analytics
Much like how LDWs can aid AI and ML, machine learning technology itself can also be used to automatically tune LDW performance. This is accomplished by leveraging ML strategies to analyze past queries in the system and then automatically detect bottlenecks while identifying optimization opportunities.
This is a key benefit where the data that is processed in LDW queries tend to be huge and distributed across many different systems Therefore, and to improve the quality of service perceived by end users, it’s critical to make the best optimization decisions by following these steps:
1. Preparing the Infrastructure
The first step is to move the processing to the data to avoid latencies associated with transferring large volumes of data at query time. As analyst Rick F. van der Lans points out, data virtualization - when placed at the core of the LDW architecture - is a sophisticated data integration technology that can facilitate this step without physically moving any data. Instead, it accesses the data in its existing location, in real time, as needed.
Additionally, there may be opportunities for big performance gains by selectively moving small datasets from one data source to another. For example, if two datasets are usually combined, and one of them is relatively small, it can make sense to replicate it in order to speed up queries. ML techniques can be applied to detect these situations and automatically recommend selective data replications with low cost and big impact in overall performance, as the LDW can use this information to automatically perform these selective replications and keep them up-to-date.
2. Putting Machine Learning to Work
The next step is to leverage the natural capacity of ML to analyze past workloads. For example, ML techniques can automatically identify overlaps between groups of analytic queries, which provide opportunities for acceleration, such as pre-computing last year’s sales data aggregated by the most common dimensions (e.g. product, customer, or point-of-sale). Not only can this serve as a starting point for many different queries without needing to compute everything from scratch, ML techniques can also identify the optimal system in which to create each intermediate result in order to maximize data co-location at query time.
3. Tuning the LDW
At this point, it is possible to implement a tuning process; whereby, the ML measures the throughput of the infrastructure, suggests and/or implements changes, measures again, and suggests new changes, in an iterative process. Over time - and by continuously measuring, implementing changes, learning, and repeating - such a system will gradually improve performance, similar to a tuning expert.
LDWs that are created using data virtualization provide the only viable way to establish a unified infrastructure for analytics across large enterprises. But achieving the optimal configuration for performance can be challenging because of the distributed nature of today’s organizations. By intelligently analyzing previous workloads, ML techniques can automatically suggest (and even implement) optimization actions with the lowest cost and the highest performance gains.