Last year the big data market centered squarely on technology around the Hadoop ecosystem. Since then, it’s been all about ‘putting big data to work’ through use cases shown to generate ROI from increased revenue and productivity and lower risk.
Now, big data continues its march beyond the crater. Next year we can expect to see more mainstream companies adopting big data and IoT, with traditionally conservative and skeptic organizations starting to take the plunge.
Data blending will be more important compared to a few years ago when we were just getting started with Hadoop. The combination of social data, mobile apps, CRM records and purchase histories via advanced analytics platforms allow marketers a glimpse into the future by bringing hidden patterns and valuable insights on current and future buying behaviors into light.
The spread of self-service data analytics, along with widespread adoption of the cloud and Hadoop, are creating industry-wide change that businesses will either take advantage of or ignore at their peril. The reality is that the tools are still emerging, and the promise of the (Hadoop) platform is not at the level it needs to be for business to rely on it.
As we move forward, there will be five key trends shaping the world of big -Data:
The Internet of Things (IoT)
Businesses are increasingly looking to derive value from all data; large industrial companies that make, move, sell and support physical things are plugging sensors attached to their ‘things’ into the Internet. Organizations will have to adapt technologies to map with IoT data. This presents countless new challenges and opportunities in the areas of data governance, standards, health and safety, security and supply chain, to name a few.
IoT and big data are two sides of the same coin; billions of internet-connected 'things' will generate massive amounts of data. However, that in itself won't usher in another industrial revolution, transform day-to-day digital living, or deliver a planet-saving early warning system. Data from outside the device is the way enterprises can differentiate themselves. Capturing and analyzing this type of data in context can unlock new possibilities for businesses.
Research has indicated that predictive maintenance can generate savings of up to 12 percent over scheduled repairs, leading to a 30 percent reduction in maintenance costs and a 70 percent cut in downtime from equipment breakdowns. For a manufacturing plant or a transport company, achieving these results from data-driven decisions can add up to significant operational improvements and savings opportunities.
Deep learning, a set of machine-learning techniques based on neural networking, is still evolving, but shows great potential for solving business problems. It enables computers to recognize items of interest in large quantities of unstructured and binary data, and to deduce relationships without needing specific models or programming instructions.
These algorithms are largely motivated by the field of artificial intelligence, which has the general goal of emulating the human brain’s ability to observe, analyze, learn, and make decisions, especially for extremely complex problems. A key concept underlying deep learning methods is distributed representations of the data, in which a large number of possible configurations of the abstract features of the input data are feasible, allowing for a compact representation of each sample and leading to a richer generalization.
Deep learning is primarily useful for learning from large amounts of unlabeled/unsupervised data, making it attractive for extracting meaningful representations and patterns from Big Data. For example, it could be used to recognize many different kinds of data, such as the shapes, colors and objects in a video — or even the presence of a cat within images, as a neural network built by Google famously did in 2012.
As a result, the enterprise will likely see more attention placed on semi-supervised or unsupervised training algorithms to handle the large influx of data.
Unlike conventional business intelligence (BI) software that runs queries against data stored on server hard drives, in-memory technology queries information loaded into RAM, which can significantly accelerate analytical performance by reducing or even eliminating disk I/O bottlenecks. With big data, it is the availability of terabyte systems and massive parallel processing that makes in-memory more interesting.
At this stage of the game, big data analytics is really about discovery. Running iterations to see correlations between data points doesn't happen without milliseconds of latency, multiplied by millions/billions of iterations. Working in memory is at three orders of magnitude faster than going to disk.
In 2014, Gartner coined the term HTAP - Hybrid Transaction/Analytic Processing, to describe a new technology that allows transactions and analytic processing to reside in the same in-memory database. It allows application leaders to innovate via greater situation awareness and improved business agility, however entails an upheaval in the established architectures, technologies and skills driven by use of in-memory computing technologies as enablers.
Many businesses are already leveraging hybrid transaction/analytical processing (HTAP); for example, retailers are able to quickly identify items that are trending as bestsellers within the past hour and immediately create customized offers for that item.
But there’s a lot of hype around HTAP, and businesses have been overusing it. For systems where the user needs to see the same data in the same way many times during the day, and there’s no significant change in the data, in-memory is a waste of money. And while you can perform analytics faster with HTAP, all of the transactions must reside within the same database. The problem is, that most analytics efforts today are about putting transactions from many different systems together.
It’s all on Cloud
Hybrid and public cloud services continue to rise in popularity, with investors claiming their stakes. The key to big data success is in running the (Hadoop) platform on an elastic infrastructure.
We will see the convergence of data storage and analytics, resulting in new smarter storage systems that will be optimized for storing, managing and sorting massive petabytes of data sets. Going forward, we can expect to see the cloud-based big data ecosystem continue its momentum in the overall market at more than just the “early adopter” margin.
Companies want a platform that allows them to scale, something that cannot be delivered through a heavy investment on a data center that is frozen in time. For example, the Human Genome Project started as a gigabyte-scale project but quickly got into terabyte and petabyte scale. Some of the leading enterprises have already begun to split workloads in a bi-modal fashion and run some data workloads in the cloud. Many expect this to accelerate strongly as these solutions move further along the adoption cycle.
There is a big emphasis on APIs to unlock data and capabilities in a reusable way, with many companies looking to run their APIs in the cloud and in the data center. On-premises APIs offer a seamless way to unlock legacy systems and connect them with cloud applications, which is crucial for businesses that want to make a cloud-first strategy a reality.
More businesses will run their APIs in the cloud, providing elasticity to better cope with spikes in demand and make efficient connections, enabling them to adopt and innovate faster than competition.
Apache Spark is lighting up big data. The popular Apache Spark project provides Spark Streaming to handle processing in near real time through a mostly in-memory, micro-batching approach. It has moved from being a component of the Hadoop ecosystem to the big data platform of choice for a number of enterprises.
Now the largest big data open source project, Spark provides dramatically increased data processing speed compared to Hadoop, and as a result, is much more natural, mathematical, and convenient for programmers. It provides an efficient, general-purpose framework for parallel execution.
Spark Streaming, which is the prime part of Spark, is used to stream large chunks of data with help from the core by breaking the large data into smaller packets and then transforming them, thereby accelerating the creation of the RDD. This is very useful in today’s world where data analysis often requires the resources of a fleet of machines working together.
However, it’s important to note that Spark is meant to enhance, not replace, the Hadoop stack. In order to gain even greater value from big data, companies consider using Hadoop and Spark together for better analytics and storage capabilities.
Increasingly sophisticated big data demands means the pressure to innovate will remain high. If they haven’t already, businesses will begin to see that customer success is a data job. Companies that are not capitalizing on data analytics will start to go out of business, with successful enterprises realizing that the key to growth is data refinement and predictive analytics.
(About the author: Bharadwaj “Brad” Chivukula is senior technical manager at Nisum)