Data generated from IT infrastructure and applications can provide valuable insights into business operations, trends and customer behavior but can prove difficult for many organizations to aggregate, sort and understand. This is primarily because much of machine-generated IT data is unstructured. It is important to consider the challenges of collecting and understanding unstructured IT data as well as identify best practices for how companies can make IT data actionable and answer key business questions such as how users are consuming services, which products are becoming popular on e-commerce sites or how traffic is impacting a website.
What is IT Data?
IT data is everywhere, and it comes in many forms. It can show up in traditional forms, such as log files from sources like Web server logs and UNIX/Linux system logs. It can also come from network devices like switches, routers and firewalls or mainframe data like the RACF subsystem on IBM mainframes.
But it can also appear from a wide variety of sources that may not typically come to mind as directly associated with log management. This includes the data that comes from anything with a digital heartbeat, like from smart meters in a smart grid and health care data mandated by the HIPAA set of regulations. These types of data are important to be aware of because there are regulations around storing and retaining it. Transactional information from credit card transactions is also included and mandated as part of the Payment Card Industry Data Security Standard.
Collecting IT Data
Because IT data is everywhere, you need a wide array of tools and tricks to gather IT data in a single place so it can be stored, indexed and retrieved. There are two essential methods of doing this. With the “push” method of collection, the systems, devices and applications deliver data automatically to the central repository. The “pull” method is where the central repository has to reach out and retrieve the data from various places across an IT installation.
The push method of collecting IT data is straightforward. A common protocol, Syslog, is generally used by UNIX and Linux servers as well as by many network devices, like switches, routers and firewalls. But there are cases where files on disks contain interesting diagnostic information but have no means of making it to the server. In this case, a software agent can be used to send the file to the central repository and transmit any changes. In the case where the connection is unreliable, a smart agent can be used to buffer information while the network connection is down, and send the information as the connection recovers.
This could happen to a large retailer who has thousands of stores in far-flung places. Some of the remote sites are likely to be connected to the Internet by unique means like satellite or even dial-up. In these cases, bandwidth management is an important concern. Another important concern with agents is to ensure that any data transmitted is securely encrypted.
The pull method of collecting IT data occurs when the central repository has to reach out on a periodic basis and obtain data from other systems, devices and applications. This may happen when the device cannot send data on its own or the process of sending the data would cause operational issues. This kind of collection is done via standard protocols like FTP and HTTP, their secure counterparts and SSH/SCP. This kind of collection is also usually set up to occur during regular times periods (e.g., once an hour).
There are a few basic rules to follow when collecting IT data:
- Know your data. It is important that you understand your data and its behavior. An IT data architect should understand the average size of the IT data messages sent, but also the standard deviation. The standard deviation is important because understanding the variability of the unstructured data collected will give you crucial insight into how your IT data management system will perform. For example, knowing that your systems tend to log “heartbeat” messages regularly can help you to know if your systems are still operating. A delayed or missing a heartbeat message means you may have trouble.
- Collect everything, but not all at once. This is also known as “know your workload.” Collect IT data from as many sources as possible. This is important for two reasons. First, it is generally impossible to know ahead of time where your security breach or unexpected operations issue will occur, so have as much data to review as possible because you can’t see what you can’t record. It is also important to understand the relative importance of the data streams, as this will give you an indication of how your system will perform during peak usage and data floods.
Storing IT Data
The main challenge of storing unstructured data is that in order for the data to be useful, it must be stored as economically as possible. In addition, the data must be queried quickly in order for users to get value out of the data store. These two requirements are often at odds.
Storing data efficiently can be solved a number of ways. Modern relational database systems are good at storing data for use by a SQL query engine but are very inefficient for scanning the data. Compressing the data on disk is extremely efficient, but this makes querying the data more difficult since compressed data usually has no schema information behind it.
Understanding the kinds of queries a set of users is likely to pose is important. There are two general query types that users are interested in: string search and full query. What is unique about searching across IT data is that time is incredibly vital. Each query will have a time selector associated with it (e.g., “Find ‘reboot’ in the system log in the last five minutes”). String search is accomplished by searching for strings of characters in the IT data. This can be done very rapidly with traditional computer science string search techniques. Full query can be done as well, but indexes must be built alongside the data, and this increases the storage needed.
As a result of these challenges, modern IT data management systems support both string and full query search methods.
Making IT Data Actionable
It is useless to collect and store unstructured data unless you can get intelligence and actionable insight from the data. It is important to have a reporting system attached to an IT data management system that can provide structured, human-readable reports on the aspects of IT data management systems that are meaningful to the business. But reports by themselves are insufficient. There must be a system in place to circulate the report to ensure that its contents are recognized and that in the event of a problem some form of remediation takes place as well. This workflow component is a critical part of any modern data management system.
Joy’s Law, attributed to Sun Microsystems Co-Founder Bill Joy, claims that most of the smart people do not work for you or, stated another way, “innovation happens elsewhere.” For our purposes, this implies that no matter how well-constructed the reporting interface is or how well thought-out the workflow process is, there will always be a customer need that was not considered when the system was designed. For this reason, all modern IT data management systems need to have an application programming interface (like XML-based Web services) to add and query data. This allows custom-built applications to extend to the IT data management system and augment it, using modern programming languages like Java, PHP or Visual C# from Microsoft.
The key is that once properly collected, IT data can deliver powerful insight. For example, if you see a login attempt in Seattle and a key card is used in Stockholm for the same person, a data center manager should be alerted that there is a security problem. Moreover, if you are tracking logs from a Web content management system, you could use IT data to indicate what content is most popular with readers. It could also be used to track Web crawlers, like Google, and check on their frequency.
Data generated from IT infrastructure and applications can provide valuable insights into business operations, trends and customer behavior, provided that it is collected, reported on, and analyzed using a few best practices. Using IT data in this way can improve visibility into IT operations and can also help to keep IT operations more secure and more in line with regulatory requirements.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access