The existing approaches to big data -- focused upon log analysis and batch processing of business data with technologies like Hadoop -- are woefully inadequate when it comes to being able to process streams of metrics, streams of relationships, and streams of state in real time.
‘Process’ in these cases means the ability to ingest these streams as they arrive, continuously perform operations on these streams to add value to the data, and then transform the data and the relationships into forms useful to people using market leading query and visualization tools.
Today most of the data that is “collected” into big data back ends is collected by having someone or something query the data from its source.
Smart phones broke this paradigm, as it is impossible to query billions of smart phones and ask them for their data. In this new world it must become the responsibility of each end device to provide or push its data into the new real-time back end for these metrics, relationships and state.
All systems management software that relies upon querying things for data is now legacy software, unsuited for the modern world. The modern data collection paradigm needs to be based upon streams of data pushed from each device via a message bus like Kafka, to a back end that is capable of ingesting them all in real time and processing them all in real time.
One of the main reasons there needs to be comprehensive, real-time data collection is because legacy systems management products collect data, at best, every 5 minutes, and many collect data as infrequently as hourly. This leaves too much time in between when something bad happens and the system knows about it.
Modern data collection needs to be real-time and continuous. Legacy management systems cannot deal with volumes of data, they pursue “sparse” approaches to data collection that sample and that fail to collect data comprehensively across all of the aspects of the software and hardware systems that support an end user, a device or an application.
Modern data collection needs to comprehensively collect data from every layer of the hardware and software ecosystem that supports an interaction or a transaction.
Still, the diversity in the sources of new management data is too great for any single vendor to stand a chance of collecting them all, or even of collecting all of the metrics, relationships and states the pertain to one set of interactions or transaction. The pace of innovation in this industry is simply too fast for any one vendor to be able to keep up. Therefore, only an approach that recognizes that there will be many sources of data and many vendors who specialize in collecting various types of data will succeed.
The issue then becomes that it is impossible to know ahead of time, and to be able to plan ahead of time, for how future streams of data will be related to existing streams of data. Therefore, these relationships must be established at the time that each new stream of metrics and state are added to the system. All previous attempts to pre-define a model of an environment, like the Common Information Model (CIM) of computing are now invalid since anything defined by a committee cannot keep up with the pace of innovation in these new environments.
Data needs to arrive in real time (no extract), and be stored in a useful form on a continuous basis (no more Load). Instead streaming ingest, coupled with continuous and real time transformation, and streaming writes of the resulting useful data needs to replace ETL.
Then, all of this data, and the relationships between these streams of data need to be stored in a real-time data store that can keep up with the ingest rates from these new environments and crucially, make this data instantly and continuously available to modern analytics and BI tools like Tableau and Qlik and modern real time visualization systems like Grafana.
Real-time streaming combined with continuous transformation needs to replace existing batch processes. Single-vendor approaches need to be replaced by an ecosystem of vendors that can collectively keep pace with the pace of innovation backed by a high performance big data back end.
(About the author: Bernd Harzog is the founder and chief executive officer at OpsDataStore, a real-time big data back end for all IT operations management data and vendors.)
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access