At least 73 percent of businesses have or plan to invest in big data within the next 24 months, according to a recent Gartner survey. The big data machine is clearly here to stay, with today’s organizations searching for new ways to derive actionable business value from their troves of information.

While Hadoop-powered data lakes are a popular choice for analyzing data “after-the fact,” there is often missed opportunity in analyzing real-time data streams the moment the information is received, beforeit enters the enterprise data store. Mike Gualtieri, principal analyst at Forrester Research, refers to the temporal nature of the opportunity as “perishable insight.” To maximize the business value in real time data, organizations need to look at new approaches to managing data that combine analytics on streaming data with context (or ‘state’ in database terminology) with analytics derived from historical data. 

In fact, real-time data analysis – performed as data enters the organization’s data pipeline, before it is passed downstream to long-term analytics engines -- is driving net-new business insights and changing how businesses are approaching data analytics.

Towards A New Enterprise Data Architecture

As companies examine how best to leverage big data, two areas of development effort have come under close scrutiny: how to build new applications that take advantage of streaming data as it enters the organization – the ‘fast data’ aspect of the data pipeline - and how to develop new analytics capabilities – often the big data part of the pipeline. In the process, enterprise architects and CTOs are coming to a conclusion: what in many cases have been two separate functions – developing applications, and performing analytics – are beginning to merge, driven by the need to build IT infrastructures that enable companies to compete by tapping the economic value in streaming data in real time.

Analytics and fast data

Analytics, much like ‘real time’, is a word that appears to mean just about anything. In practice, however, analytics can be divided into two broad buckets:

1. Analytics run on historical data (“batch analytics”) to perform certain tasks, e.g., create seasonal trending and recommendation matrices. For these use cases, analytics are run on the big data side, which has sufficient (and complex) technology to access a lot of historical data quickly.

2. In the second bucket is analytics taking place on streams of data on a per-event basis, e.g., deciding if a mobile user will exceed their account balance and should receive an offer, or alerting when a sensor triggers an alarm threshold. These analytics often happen on the fast side. Interestingly, in the first (big data) use case, analytics are used for reporting. In the second (fast data) use case, they are used, in real-time, to inform and take action. That is, the business sees the value of putting in-bound streams of data to use to uncover an insight or opportunity and take action.

Another way to say this, in a slightly simpler framework, is that there are basically, in this space, two types of applications. There are applications against data at rest. These big data applications focus on exploration, analytics and reporting. Then there are applications against data in motion -- fast data applications. These data-in-motion applications are a combination of streaming analytics with transactions, typically performed by a platform that has the speed to handle real-time, high velocity input.

Three Case Studies

Let’s look at how three companies have benefitted from the ability to mine real-time data, using in-transaction analytics, to drive business value.

One provider of customer marketing software and managed services uses its real-time, per-event decision-making platform to deliver mobile service providers a 253 percent increase in offer purchases; a 50 percent increase in data bundle sales; and 157 percent higher conversion rates. The same platform enables the company to help its customers avoid bill shock by using in-transaction analytics to determine if a customer has necessary credit or minutes left to place a long-distance call. If the reporting systems indicates the customer hasn’t got the time left, an in-call offer is made, suggesting the customer buy more minutes or upgrade a data/call bundle.

Similarly, a global communication services provider (CSP) uses a new data mining technology to obtain up-to-the-second operational visibility into the performance of its systems across its customers’ carrier-grade TV networks, as well as enable real-time user targeting. The result: fewer outages (and fewer upset viewers), as well as the ability to target ads in real-time to specific audience segments. The CSP uses real-time analytics performed as viewing data pours in, and also blends in data from its longer-term data lake. The combination of instant visibility into which viewer is watching which program, combined with historical trend data pulled from the data lake, enables the company to take immediate action on data in motion, while also leveraging data at rest.

Finally, a big data analytics solution provider for communication service providers mines its data stream in real time to combine fast data and big data to create competitive differentiation and sustained economic value generation. This company – which services over half a billion subscribers in 32 countries – processes four billion new data events a day. Leveraging fast data, the company delivers real-time triggers, or notifications, which improve operators’ targeted marketing efforts. These real-time triggers – configured based on subscriber actions – notify the operator to take the most appropriate course of action to keep up with and respond to the subscriber. The payoff has seen the company improve subscriber conversion rates by up to 300 percent, and increase ARPU by 19 percent.

Three Categories of Fast Data

So we see that fast data applications really fall into three large categories: applications that are focused on real-time analytics; applications that are really data pipelines; and applications that are designed to enable fast request-response. Each of these three has different characteristics.

Real-time analytics applications are often about finding the summary of the incoming stream -- counting. Data is used to produce a real-time summary of the incoming stream for two different types of user. One user is the business user looking for operational monitoring. They ask questions like, “Are all of my API providers providing me data? If I’m collecting data from a smart grid and a large distributed network, do I have a set of concentrators that are down?” Once you’ve made these data feeds transparent, you can begin to write applications that automate responses to changes in these data feeds - applications that use real-time analytics in order to make the world a better place.

In the pipeline case, the data flow is actually quite different. Data is flowing through the framework, through the application or through the fast data application. The fast goal here isn’t so much about summarizing data but is most often about preparing data to be archived into an OLAP system for historical analysis.

For example, you might want to track marathon runners as they run races. Put a sensor in a bib or in a shoe, and when the runner approaches a milestone at the race, the sensor pings the fact that the runner is at that milestone. The application could then tell people the runner is nearby, or it could calculate the runner’s overall position by cohort, e.g., by age or by gender. But from the pipeline perspective, the important activity is being able to filter out all of the redundant proximity readings to a single reading because all of those downstream applications only need that one event.

Then there are request-response applications. These are fast data applications, applications that often exceed the capability of legacy systems in terms of frequency of requests and frequency of responses. They typically involve personalization, recommendations or authorizations, and are used to limit or enforce a policy against some resource access that boils down to authorization. Another use is to serve someone a response or piece of information that is customized based upon the customer’s historical value to your company.

Bottom Line

When you’re thinking about building fast data applications at scale, it’s important to think differently about fast and big – they have very different requirements. As you think through your challenges, understand that to implement these applications, you need to take into account not just the processing the incoming data feed but also the need to add context and awareness which requires real time transactions, as well as the ability to use the analytic results that are derived from your data warehouse - to use as you make real-time decisions. In addition your system must be capable of producing low latency responses measured in single-digit milliseconds to capture “perishable insights.”

John Hugg is founding software engineer at VoltDB.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access