IT infrastructure is getting more and more complex. Rapid application deployment, software-defined networking (SDN), virtual computing environments, platforms as a service (PaaS), and public, private, and hybrid clouds have created additional dimensions of variability to the already diverse set of transaction types present in today’s IT infrastructures.

IT environments can change daily and sometimes even hourly. Long gone are the days where a static network diagram was a sufficient description of the IT environment. At the same time, IT data volumes are growing exponentially.

Devices and applications generate massive sets of log data, tracking activity related to networks, computers, users, and various applications. Key performance indicators (KPIs) are identified and tracked. Error logs are captured, and general operational statistics are recorded.

The management of data associated with managing IT infrastructure has become a big data challenge. Keeping these complex IT environments running smoothly is getting harder. When attempting to diagnose application performance issues, it can be especially challenging to assess the impact caused by the various physical and virtual communication paths present in these complex environments.

Teams also worry that too many staff members are spending too much time troubleshooting. When incidents such as outages or slowdowns do occur, IT operations teams strive to detect and resolve them before they are noticed and reported by end users; however this can be a difficult challenge.

So the question is: How can IT Operations teams find these problems before users do?

One approach (not recommended) is to create a visualization of every KPI in your environment and hire an army of operators and analysts to monitor every one. This is neither efficient nor cost-effective.

Another possible approach is to employ traditional infrastructure monitoring that uses rules and thresholds to detect possible error conditions, but this has proven to be very “noisy,” generating lots of false positive alerts, and generally unable to detect and assist with resolving specific application level problems.

Here, we suggest that embracing automation and artificial intelligence may be the answer.

Now before you stop reading, we're not talking about the science fiction kind of AI, but a pragmatic approach to using technology to help with the management and monitoring of IT infrastructure. Leading edge organizations are starting to deploy analytics to help them monitor and respond to IT incidents.

IT Operations Analytics (ITOA) is a general term used by industry analysts to describe the type of analytics applicable to the management of availability and performance of IT infrastructure. Sometimes described as behavioral analytics, this technology includes the ability to use machine learning-based algorithms to model behaviors in metric and text-based data, detecting patterns, trends, and anomalies that may be used as part of an organization's alerting and root cause discovery capabilities.

By leveraging machine-learning-based IT Operations Analytics tools, IT operations can make a significant difference in the performance and effectiveness of identifying issues, even enabling them to fix and resolve most issues before end-users are affected.

(About the author: Mark Jaffe is chief executive officer at Prelert. He is a serial software entrepreneur who has been instrumental in the success of numerous software companies. Over his 23 years of high tech experience, Mark has held roles in product marketing, software sales and executive management.)