Big Data, once a buzzword, is now commonplace across all industries, as companies want to leverage the tremendous volume and variety of data available – much of it now in real-time – to predict customer churn, improve customer service or better understand their own performance. But these projects are easier said than done.

Many businesses, including a large number of our customers, are making significant investments in technologies like Apache Hadoop and Apache Spark, but their ability to gain value and insight is dependent on getting all of their enterprise data into one central data hub/data lake. While this is no simple task, the insights gained from being able to analyze all enterprise data – no matter the source – is well worth the effort.

At a macro level, organizations have never had more tools available to integrate their enterprise data for meaningful insights, especially with powerful compute frameworks like Hadoop and Spark. However, accessing, integrating and managing data from all enterprise sources – legacy and new – still proves to be a challenge.

For example, accessing data from legacy mainframe is particularly complicated – with regard to connectivity, data and file types, security, compliance and overall expertise. However, with 70 percent of corporate data still stored on mainframe computers, organizations need a way to leverage that data for Big Data analytics.

If that critical customer/transaction data isn’t part of the data lake, a significant piece of the puzzle is missing. This data is crucial to serve as historical reference data whether to be used for fraud detection, predictive analytics to prevent security attacks or for real-time insights on who accessed which data and when. By liberating this data from the mainframe, companies can make better and more informed decisions with information they might never have had the opportunity to explore before – significantly impacting their growth and profitability.

Newer data sources deliver even more potential – and complexity. Streaming telemetry data, data from sensors and Internet of Things (IoT) use cases require additional components in the technology stack.  Supporting connectivity to these data types is an obvious requirement, but the real value lies in the convergence of these streaming data sources with batch historical data.

Combining real-time and batch data for bigger insights and greater business agility requires a single software environment. Because of this, we will see more focus on simplifying and unifying the user experience so organizations have a single hub/platform for accessing all enterprise data – in real-time.

So what exactly does “real-time” mean?

To some users it means every hour and to others it means every sub second. To me, “real-time” is simply a matter of reacting before it’s too late – before you lose a customer or before there is a data breach. The ability to react quickly is all about being able to access both historical/reference data from batch sources and streaming data for instant analysis. This is where the data hub comes into picture – you need all enterprise data available at all times.

Evolving technologies are helping to make ‘real-time’ a reality, however this is a difficult process for most organizations. With disparate business units and groups involved, cooperation between business departments is critical. When they’re able to work together, one of the most beneficial use cases is gaining operational intelligence, with operational data actually being processed in Big Data analytics platforms like Apache Spark.

The emerging platforms such as Hadoop and Spark not only help with faster advanced analytics, they also lower the cost of advanced analytics. This is especially so with the cost of analyzing security and telemetry data directly on the mainframe, which can be quite expensive.

This has been an important driver for some of the initial operational efficiency use cases many organizations started with. As organizations transition into more transformative use cases, they are more challenged with the rapidly evolving technology stack and multiple compute platforms.

The bigger advantage of Spark lies in the promise of being a single compute platform for a variety of workloads, streaming and batch, predictive analytics and interactive queries, etc. The intelligence an organization is able to collect with real-time analysis is seemingly endless, but there are a few best practices for getting the most value from an enterprise data hub.

Prioritizing ROI, governance and security, and “future-proofing” are all keys to increasing the value of data integration. To ensure solid ROI, organizations must include all critical sources of enterprise data – including those that have been traditionally handled in a silo, such as mainframe; leverage in a cost-effective and scalable environment – such as Apache Spark; and free up their staff from coding and tuning so they can tackle higher-value projects.

Of equal importance, any data integration environment needs to ensure the security of the data, as well as data lineage for compliance requirements. This is particularly true in highly-regulated industries such as financial services, banking, insurance and healthcare.

Lastly, to maximize the value of data integration investments, companies need to take an approach that allows them to keep up with the latest innovations, but insulates them from the disruption and cost of re-writing all their jobs and acquiring new skills every time a new compute framework comes out. Whether the underlying compute framework is Hadoop MapReduce, Spark or Flink should be irrelevant to the user who is working with a high level data pipeline.

We’re already seeing more convergence between batch analytics and operational analytics – driving the adoption of a single platform for streaming data sources and batch data sources. With the increase in the number of connected devices and IoT use cases, we only expect this to grow. And if organizations can overcome the many challenges associated with this integration, the benefits will be endless.

(About the author: Tendu Yogurtcu is General Manager of Big Data at Syncsort)

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access