IT Ops are now entering the third generation backed by cloud computing and automation, where setting up a new environment is a matter of a few clicks compared to installing a bare metal server in the second generation.
Instances could be summed up by developers on a per-application basis or launched automatically due to increased user load. The complexity, as well as the number of active servers to manage, has increased significantly, resulting in a much larger amount of collected data to sort through and track.
According to a 2015 Application Performance Monitoring survey, 65 percent of surveyed companies own more than 10 different monitoring tools. Despite the increase in instrumentation capabilities and the amount of collected data, enterprises barely use significantly larger data sets to improve availability and performance process effectiveness with root cause analysis and incident prediction.
To make sense of the giant piles of data, IT Ops have turned to machine learning. This field studies how to design algorithms that can learn by observing data, discovering new insights in data, developing systems that can automatically adapt and customize themselves, and designing systems where it is too complicated and costly to implement all possible circumstances (such as search engines and self-driving cars).
There has been a significant increase in machine learning applications in IT Ops due, in large part, to the ongoing growth of machine learning theory, algorithms and computational resources on demand. Many organizations are finding that machine learning allows them to better analyze large amounts of data, gain valuable insights, reduce incident investigation time, determine which alerts are correlated, and what causes event storms – and even prevent incidents from happening in the first place.
For example, VSE Corporation, one of the largest US government contractors, implemented a machine learning solution to crunch their vast amounts of data. Using this approach, VSE was able to deliver insights that dramatically cut incident investigation time, facilitated validation of environment changes and helped VSE stay in compliance effectively and efficiently.
To address modern key IT Ops challenges, following are six different machine learning trends being leveraged over previous correlation approaches:
Trend 1: Actionable insights in natural language
Reviewing, processing, and interpreting ever increasing amounts of data has become an integral part of daily business in IT Operations. A typical scenario includes an IT Ops user who looks at a dashboard containing various dimensions of data, tries to analyze this data through pie charts or tending lines commonly set up through manual configuration.
To reach some level of automation, users typically have to have a general understanding of the data in front of them, select portions (i.e., data sets within overall data), and select suitable analysis tools (e.g., trending formulas, chart parameters, etc.). For small amounts of data, this may not be a daunting task, but small amounts of data also provide a less accurate snapshot of the overall story.
When more accurate results are desired or available data amounts are extremely large, common manual configuration-based tools are likely to be inadequate at best or unusable at worst.
Last year we saw significant progress in the ITOA space blending and correlating multiple data sources. However, most of the ITOA solutions still require customers to slice and dice outcomes of blended analysis in order to interpret them or present these outcomes in a complex, specialized manner.
This year, ITOA technologies will leverage recent advances in machine learning to automate data interpretation. The result will be a generation of specific, easy to understand insights that can be utilized by IT Operations teams without significant training and investigation overhead. The machinery of the analytics will be hidden from the users, who will be consuming and acting on automatically generated findings, guidelines, and instructions, which will be presented in a native language.
Trend 2: Smart chatbots
Chatbots are an exciting new trend in the technology ecosystem and are starting to become relevant in the enterprise. Chatbots have been called the “new apps,” and enterprises are adopting chatbot platforms to enhance their operations, elevate IT interactions, and help users find information and complete tasks.
Today’s Chatbots are fairly basic, but they will become more technologically advanced in the near future. They will continue to provide automated conversations that will allow users to do everything from check the weather, manage personal finances, shop online, and more. And, of course, the business applications are limitless.
Chatbots function with cognitive systems by using a set natural language understanding, machine learning, and artificial intelligence. Chatbots understand language beyond specific commands, remember context of the conversation, and get smarter as they learn from conversations they have.
Trend 3: Anomaly detection using behavior signatures
One of the fundamental problems in monitoring key performance indicators is to decide when to act on readings. In general, we’re interested in two aspects:
- Monitoring potentially bad situations, which are usually specified as policies identifying known problems (e.g., threshold for low remaining disk space alert).
- Monitoring good situations and when they stop happening. It is important to identify unknown problems, such as deviation from steady/stable state, drop in desired behavior of the system, and sudden decrease in performance, etc.
Approaches to detect such situations typically rely on dynamic thresholds using standard deviation calculations. These aim to detect such deviations, but in practice such models are just too simplistic and not able to deal with convoluting signals, causing too many false alerts.
The machine learning algorithm approach to this problem is to identify normal system behavior and report any anomalies deviating from it. This can be achieved by constructing behavior signatures for a particular time period and by applying an anomaly detection algorithm atop of it.
Such algorithms first observe how the system normally behaves and then they start reporting significant deviations from it. Moreover, the algorithm is able to continuously adapt its behavior signature library, thus learning how behavior changes over time.
Trend 4: Alert clustering - Building situational awareness
As the IT environment size increases, so do the number of alerts. For example, a large international bank deployed a set of monitoring tools across 40,000 servers, producing 600,000 events per hour. These, in turn, generated 47,000 help desk tickets annually with 2,000+ Level 2 escalations - more than 60 escalations daily. However, in most cases, alerts are correlated to each other.
A change in an operating system driver might cause a database service to hang, which then triggers a storm of alerts originating from various applications relying on that database. Looking at each alert individually leads to a long response time, failed transactions, service unavailability, etc. Each alert separately doesn’t give the clear answer about what’s happening. Investigations take significant time, effort, and expertise to identify the root cause. Can we automatically examine tens of thousands of alerts to arrive at the same conclusion?
This is where machine learning steps in, with clustering in particular. Clustering is an unsupervised machine learning technique that logically groups similar items together. “Unsupervised” machine learning indicates that there is no guided learning involved – the algorithm automatically identifies meaningful relationships.
There are two fundamental approaches to clustering: bottom up and top down. In bottom up, the algorithm starts by treating each alert as its own cluster, and then iteratively merges clusters that are similar until the remaining clusters are too different from each other. Similarity could be defined as distance in time, host, service, etc. The top down approach starts with a pre-selected set of clusters, then iterates over the alerts adding each one to the nearest cluster.
Clustering thus establishes high-level situation awareness by removing redundant, low-quality alerts and by clustering alerts into meaningful groups.
Trend 5: Root cause analysis using causal reasoning
Root cause analysis is one of the top problems IT Operations teams still struggle with today. Gartner reports that “root causes of performance problems have taken an average of 7 days to diagnose in 2016, compared to 8 days in 2005 and only 3 percent of incidents were predicted, compared to 2 percent in 2005.”
While having multiple monitoring tools are useful in offering critical alerts on IT infrastructure, they don’t indicate the root cause of a problem – a valuable task.
IT tools for monitoring and governing the application lifecycle management process typically don’t talk to each other. For instance, to deploy a new service, a new change request is opened and executed via an automated deployment script. Once the application is up and running, performance and availability are monitored with logs, network activities, and key APM metrics. There is no one drawing a red line connecting the events together into a holistic overview of the operations. Automatic root cause analysis depends on establishing relationships between data sources. Correlating events, tickets, alerts, and changes can identify cause-effect relationships.
To achieve these, machine learning can be applied in two stages. The first is to link the data from different IT tools, and the second is to determine where it makes the most sense to correlate.
In the first stage of dealing with unstructured data, the linking process is not obvious. Machine learning can infer relationships among different data sources and determine how to link them to environments. Algorithms include fuzzy matching rules and association rules identifying events that frequently occur at the same time, linguistic analysis of data in natural language, and prediction models estimating system change effects. This process yields a set of data samples semantically annotated across silos.
The second stage establishes an environment dependency model based on environment topology, component dependencies, and configuration dependencies.
Such an environment dependency model can be used to apply topology-based correlation by suppressing root causes of elements that are unreachable from the environment where the problem was reported. On the other hand, such a dependency diagram can be modeled with the probabilistic Bayesian network, which may augment the model with probabilities of error propagation, defect spillover, and influence.
Building such a model is practically infeasible as it requires specifying many probabilities of influences between environment components even without addressing constantly evolving environment structure. However, by utilizing machine learning and vast amounts of data describing historical performance, it is possible to build a model that estimates all of the required probabilities automatically and updates them on the fly.
Trend 6: IT process mining
Detailed instrumentation and the vast amount of collected data enables organizations not only to track defined processes but also to identify informal, non-documented processes and activities taking place in the IT infrastructure. The core machine learning technology behind this task is process mining.
Process mining automatically crawls existing change request records, incident records, deployment information, event logs, and actual changes in the IT system to identify processes underlying IT operations in an organization. The algorithm behind the scenes analyzes dependencies between events and identifies steps frequently appearing together.
The same type of algorithm powers retail basket analysis in e-commerce, identifying what items are frequently bought together, what the next product is that customers are likely to purchase, and how to bundle products/services together to maximize revenue. Such analysis can lead not only to better understanding of the processes, but also to their improvement and automation.
IT operations grew large enough to automate all of the functions that can be automated, and use detailed component instrumentation to ensure everything is running, as it should. IT operations analytics has now entered a new age – an area of algorithmic IT operations turning the learning algorithms loose on the vast amount of collected data, alerts, tickets, and measurements in order to extract the hidden insights that provide accurate alerting, establishing situational awareness, finding the root causes, and predicting incidents even before they happen.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access