Overheard: Embedded Analytics and Next Gen Data Warehousing
Why are we hearing about embedded analytics and what are we talking about?
What we’re often talking about is analytics that are embedded in applications or embedded in business processes or in the decision cycle of the company. Another way of looking at it is where analytics are embedded or executing in some kind of platform where analytics aren’t traditionally executed, such as in the database or in the data warehouse.
Why are we interested in embedded analytics?
A data warehouse is fundamentally a very specialized database, and a database traditionally is all about keeping tables and records and indexes to support queries and updates, but it is not a place where you do a lot of analytics and processing. But if you embed analytics in that database, it becomes a computing platform sort of like an application server where we use stored procedures and user-defined functions. Now there are new ways to embed analytics to execute on databases, data warehouses and new approaches like MapReduce and Hadoop. In the new world on the data warehouse side, embedded analytics is referring to an open framework that is different from the way we did stored procedures and user-defined functions on traditional proprietary platforms.
Why are we moving in that direction?
For one thing, the data warehouse is becoming the biggest, baddest repository pool of computing power, memory, I/O, bandwidth and storage in your company. It’s a powerful platform where more and more data will come to stay, hundreds of terabytes and petabytes that are hard or wasteful to slosh around between application platforms. Rather than move all that data, we can think about moving the applications to that. In that way the data warehouse is almost a gravitational force with satellite applications executing natively there. As the data sets get larger and as the applications get more demanding in terms of compute power, it just makes sense to move more of this logic to this new generation, big data warehouse.
We used to look at data warehouses as a place to put historical data that didn’t have the same currency as the operational data that’s closer to real time.
The data warehouse has evolved to be architected for real-time applications and a real-time ingesting query. Teradata and other vendors have gone deep in this direction. So we’re now seeing the data warehouse with both historic and real-time data and batch processes and low latency processes coexisting in a common computing and storage fabric called the data warehouse grid or cluster or cloud.
We’ve tried to move away from static, monolithic data sets to in-memory or on the fly processing, so where do the two meet?
Well, this is true. Because in-memory is much faster than disk, if you want true real-time apps end-to-end, you need a completely distributed in-memory caching environment and a heck of a lot of I/O bandwidth. The real-time fabric of modern life will require a push to all in-memory data architectures in data warehouses everywhere, but also into clients. As memory gets ever cheaper and we’re able to persist more info in solid state disk and flash on devices, users are going to want many things. One of those is the ability to bring a lot of information into their own device for exploration and visualizations and what-if analyses on the fly. They’ll eventually be able to work with multiple gigabytes and potentially terabytes of information right in their hands. At some point you’re going to want a complete personal data mart in your iPhone so you don’t need to do the round trip back to the data warehouse or server site data mart to grab all this information. We’ll see more in-memory BI architectures that are very mobile-oriented.
But will the monolithic or virtual data warehouse continue to grow?
The majority of data warehouses in the world now are between one and 10 terabytes total. As memory gets cheaper and you can begin to have terabyte scale memory on a handheld client, all that traditional information will fit in anyone’s pocket. And the data warehouse on the server side where the master records are kept will grow ever larger, into petabytes. The caches that you hold locally will be synchronized for your use, and your user device will personalize the displays, the calculations and the functions on the cache to meet your specific needs. And they’ll auto-synchronize with a server site according to [user access] policy. It will be real-time decisioning in the field without that round trip, and while there will be bandwidth constraints with wireless, you won’t sync all the data sets in a split second. But it will be more and more critical to have a deep and fairly current cache of the latest
information in your hands.
So a lot of data will persist on the server side and users will access decision-making information like a sponge?
Exactly. It will be important to manage the transmission and the storage requirements on the client side to control the replication and synchronization. But the bottom line is that the user doesn’t have to incur that latency on the run, they can query massive marts and put them in their pocket at memory speed to support the transactions and decisions they are making now. Real time will enable ever more agile and data-rich exploration and calculation in the field for everybody.