Most data gets old or, rather, loses its production value very quickly -- often instantly. It then begins to pile up on file servers, slows hardware performance and poses an unnecessary risk. More and more organizations are realizing that unconditionally expanding their storage footprint is simply not sustainable or productive.  

The question then becomes how to reduce it and retain that which adds value to existing and future business. After all, some data may be subject to regulation or compliance as well as internal policy mandates.

Metadata versus Content

Electronically stored information is binary in nature and, not unlike its composition, also has two dimensions to its worth. The first of these is the metadata surrounding a data object or file. The second is the actual content of a file.

Metadata is the information about the data object itself or its descriptors. A similar comparison would be a container; metadata is the container for actual data. Therefore, metadata resides on the outside or periphery, detailing particulars such as when and who created a file in question, when the file was last accessed and modified, etc.

The content of data is exactly that: It is the substance, what is on the inside. If you think of data as a storybook, the metadata would be when it was published and by whom; the content would be the actual story.

Structured versus Unstructured

Data is classified by one of two conditions: structured or unstructured. Without getting into the nuances behind these two definitions, in laymen’s terms, structured data is data that can be readily and easily identified. The most common form of structured data is data that already resides within a database and can be simply queried and managed from within.

Unstructured data is just the opposite. It is data that resides on just about everything else that is not a database (but may rely upon one). Examples of unstructured data may be desktops and laptops, file servers, email, content management systems (believe it or not), archives, cloud storage, etc. The takeaway here is that most of what an organization needs to cleanup and manage more effectively resides on data sources that cannot readily and easily identify the data they contain (e.g., unstructured data sources).

Step 1: Find a Common Denominator

Now equipped with an understanding of the anatomy of data, as well as where it lives or rather how it can live, how do we arrive at its worth? How do we now use what we know about it to determine its life expectancy?

Unfortunately, this answer may be different for every organization, based upon the nature of its business, whether it is private or public, as well as its internal policy. Bearing this in mind, let’s look to a best practice approach for where to start with data cleanup.

It is safe, or perhaps prudent, to assume any and all data is subject to some sort of retention within any organization. This simply means that some form or copy of all business material must exist and be readily available until some predetermined and enforced point of time. Ideally, organizations have a data retention policy in place that defines these criteria and is enforced. Unfortunately, many organizations have not yet defined such a policy or do not enforce it and, therefore, continue to pay to store garbage data indefinitely, assuming this approach is costly yet forgivable due to its complexity. 

Considering all company data is subject to retention, a safe approach is simply to determine the longest data retention mandate as defined within any regulation, compliance and/or internal policy to which an organization and/or department is subject. This maximum data retention limit may then be viewed as the common retention limit or the maximum amount of time any business material should be retained.  

For example, if a company is subject to Sarbanes-Oxley, as well as an internal policy within the HR department to retain all data for two different lengths of time, an organization may simply identify which of the retentions is longest and retain all data for that period. This method satisfies both requirements, as there is no external mandate as of yet requiring corporations to delete data upon a specific expiration.

Step 2:  Clear Out the Clutter

Once a common retention limit has been defined, a process (preferably repeatable) of identifying or filtering the necessary intelligence or filter criteria must be determined. Our example is a term of six years or more. Keep in mind this specific filter criterion is found within the metadata or the container of the data object, as it is a descriptor, not within the content.

Filtering unstructured data for criteria such as this may sound easy but can pose a significant challenge and risk. Firstly, there is the issue of having to perform this task on all of an organization’s disparate data sources that, when combined, may reach extremely large data sizes. Secondly, simply because a data object is older than six years does not necessarily mean it is out of production (i.e., not being utilized).  Lastly, and most importantly, once a responsive set of data objects have been identified, how do you then act upon these files (i.e., delete them), and how do you defend that the action applied was appropriate should it come into question?  

Step 3: Intelligent Intelligence – Leveraging Indexing Technology 

In order to extract or discover the metadata and/or content from data objects, someone or something needs to catalog or index the data and detail the details, so to speak, such as how old the file in question is. Yet, the “someone” approach can be extremely limiting in functionality, as well as time-consuming and risky, because of human errors.  “Something” approaches to this problem often address the needs of big data and, specifically, that of data cleanup.

Software-based indexing engines allow you to connect to multiple, disparate data sources and compile a database with all relevant metadata and content data, effectively structuring the unstructured. The use of a database allows the index to collect and contain all aspects of the data (all metadata, as well as the content) yet at a fraction of the total size due to its format. The availability of a searchable index of previously unstructured data would now enable the organization to address not only data cleanup, but many other data centric business needs, such as: e-discovery, compliance, document management, data migration and information management as a whole.

Step 4: Defending Deletion – Audit Reports Not Optional

Assuming the organization owns or has acquired such a solution, or perhaps contracted a professional consulting firm with such means, the first filter condition of all files older than six years may be applied. Additionally, a secondary filter condition should be utilized, addressing only files that have also not been accessed within those six years (or perhaps two years). The point of the second condition is to provide another level of comfort that the files are indeed out of production and no longer needed and may be removed.

Regardless of such insurances, deletions may still become subject to question and, as mentioned earlier, it is important to have the means to support any and all actions performed. Typical means are audit reports detailing the actions taken (such as a file deletion), as well as the proof that the file was indeed subject to the predefined filters and pursuant to company policy of removal. 

This said, most organizations find that files that have not been accessed in six years and are indeed requested at a later date often have absolutely nothing to do with work and should never have resided where they did. If a straight-to-deletion approach is still disconcerting, an organization need simply postpone the deletion and add a prior action to move the responsive data to a secondary retention space for a period of time until deletion may safely occur. This approach can address most “what happened to my music or wedding picture” inquires.

The End Game: A Gain Through Loss

In the end, considerable storage space can be recovered, significantly reducing costs and risk exposure, while also improving the quality of data and the speed at which it can be utilized. Additional advantages would be the efficiency of other work flows, such as data backup and disaster recovery, e-discovery, data migration, application and file server upgrades, as well as the adoption of newer technologies such as solid state storage.

Efficiently managing enterprise data has quickly become the number one focus for organizations that realize the overarching benefits of reduced and concise business information. As an example, according to the CGOC, typically 1 percent of corporate information is on litigation hold, 5 percent is in a records category and 25 percent has current business value. That means that, on average, more than 60 percent of corporate information has little to no value current or future business value.  

The need for organizations to incorporate policy surrounding the responsible retirement of data is critical when addressing big data concerns. Although data retention policies are not mandatory, the resulting efficiencies from such policies are undeniable, as well as prudent in regards to defensibility and adherence to regulation, compliance and the legal process as defined within the Federal Rules of Civil Procedure. It is for this reason that the widely accepted Electronic Discovery Reference Model begins with information management.

In summary, it is the quality of the data that determines its value for future business decisions. The quantity is of no use, for garbage remains garbage, regardless of how much of it you have.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access