The information in a well-conceived data warehouse is like a must-have piece of furniture - once you pull the trigger on the set you'd like, the furniture must be transported, stored and on display for everyone to see as soon as possible. At the same time, the process of transporting, storing and displaying the furniture requires the utmost precaution and care to protect the furniture as well as the surrounding infrastructure. The essence of the information and furniture is the same – you want to both protect and use it as soon as it arrives.

 

This article attempts to separate measures that can disrupt the balance of speed and protection during the transport, storage and use of the information in a data warehouse environment. Figure 1 illustrates the typical data flow from the source through the data warehouse environment and highlights the data security threat level (during transport (i.e., from the source), storage (i.e., the data warehouse/marts and cubes that may be points of exposure for unencrypted sensitive data) and use (i.e., business functions), assuming data is secured at the source. For the purpose of this article, threat level is assessed in terms of the ability to secure the data in the particular area of the environment versus the potential for theft.

 

 

A Prime Target

 

A mature data warehouse is a prime target for any data thief. It contains enterprise-wide data, integrated and modeled according to enterprise business definitions. The breadth of the data is immense. It includes industry agnostic yet data-sensitive areas like HR, finance, customers and other subject areas that are increasingly sensitive, such as insurance claims. The depth of the data is equally profound, making the warehouse a target for such a security breach. The data warehouse includes transaction level detail, and historical snapshots of dimensional data.

 

Addressing data security is a must, but efforts to implement tight controls in the data warehouse can potentially have a negative impact on performance. Over time, a mature data warehouse has been optimized and tuned for common usage patterns. Indexes, partitions, 64-bit processing and other performance-boosting measures help the data warehouse contend with growing data volumes and expanding usages. But just when data warehouse managers have optimized its environment for internally driven business requirements, data security breaches from the headlines threaten to undermine prior performance optimizations.

 

Fast, Secure Transport

 

When delivering data to the data warehouse, there are several staging points en route to the target. Ultimately, each persistent staging point represents a security threat and a performance hit. In terms of performance, over-staging the data produces a high-degree of input/output (I/O) activity, which ultimately delays the transformation and delivery of the data. If the target staging area is a database, encrypting the data adds overhead to the overall processing and persisting sensitive data in the clear exposes the data in a staging area which typically is unmonitored. Thus, to avoid performance and security pitfalls in transit avoid over-staging the data.

 

In addition to limiting I/O intensive processing, limiting the data volumes in the extract, transform and load (ETL) pipeline constrains the amount of data exposed to risk and reduces the overall cycle time. To limit data volumes, change-data capture (CDC) should be implemented for all source data acquisition. CDC optimizes the overall processing time for obvious reasons – less volume from the source means the minimal amount of transformation and loading is required. While optimizing performance, CDC also minimizes the amount of data that is staged along the way. By persisting the minimum amount of data required, if unwanted users gain access to the data, it would be sparse. More importantly, not all the data would relate to each other if you consider that most slowly changing dimensions change infrequently and facts change independent of the dimensions. Thus, data thieves have to gain access to the data warehouse to gain knowledge of the complete transaction. If full refreshes are processed, data thieves have all the data required at a dimensional and transactional level.

 

For data that is transported and archived on the file system, data should be at a network level (e.g., SFTP or FTP-S) and storage level. A reasonable file retention policy should be instituted so that only staged source and load files do not linger on the ETL server. Change data capture helps minimize the cost that secure data transport and encryption have on file archiving and retention.

 

Fast, Secure Storage

 

Once data arrives at its destination in the data warehouse, there are several mechanisms to ensure that the sensitive information in the database is safe and secure. The most powerful mechanism is data encryption through the database. Encrypting the data in the database ensures that data throughout the information lifecycle – active, online archive, offline archive and tape – is protected.

 

This protection comes at a cost. The performance impact associated with encrypting and decrypting sensitive data like a Social Security number can add five percent overhead. If the field is indexed, the index is built using the cipher text associated with the encrypted value. If arguments pass in a SQL query that reference this indexed and encrypted field, these parameters are encrypted before the SQL is executed. There is also a storage overhead associated with maintaining an integrity check associated with each encrypted value. In addition to the processing overhead, there are storage requirements for the integrity checksum values. In Oracle’s transparent data encryption (TDE), encrypting one column can add 33 to 48 bytes per row.1

 

New database features, such as compression, and data warehouse appliances that offer minimally invasive encryption can offset the overhead associated with encryption. However, applying encryption to an existing data warehouse without revisiting performance benchmarks could have severe impacts. The overhead associated could require hardware upgrades, database upgrades or a complete replatforming to meet governance and performance objectives. In short, implementing encryption in the data warehouse would have to be a managed and budgeted effort. Thus the application of encryption in the data warehouse should be judicious and selective – otherwise, the result could leave a plastic couch cover over the data warehouse.

 

Database encryption is not a complete solution. Vulnerabilities to attack can still surface through the network layer. Sensitive data must be encrypted over networks (specifically WPA or WPA-2 encryption over wireless networks). Otherwise efforts to store encrypted data will be compromised, particularly at the source of origin for the transaction. The TJX credit card breach, in which 94 million debit and credit cards were compromised, resulted from wireless networks that did not comply with WPA encryption standards (as mandated by PCI compliance).

 

In addition to encryption efforts, data warehouse managers should align enterprise information lifecycle management and data governance policies to help secure data and improve performance. A well-formed set of data policies to archive and purge information in the data warehouse does two things - limits data volumes thereby improving performance, and limits the volume of sensitive information according to data standards. While it seems counter to the core principles to a data warehouse, rolling data into secure online archive storage can reduce the volumes of data useful for analysis, which are similar to the principles of CDC, will improve performance in the data warehouse. Secure archive storage also brings customers in compliance with customer security standards, such as PCI DSS requirement 3 (specifically sections 3.1 and 3.2.1), which mandates strong data retention policies and limiting the storage of cardholder data to achieve compliance.2

 

Fast, Secure Viewing

 

The impact on performance and security when users view information managed in a data warehouse is typically aligned with the business functions and use of the information. For instance, canned reports or guided analysis usually involves limited data volumes, and well-indexed access paths to the information in question. The nature of the data is where the vulnerability lies – accessing medical claims or driver incidents for a policyholder can yield sensitive information at a micro level, but the result set is bound to a finite set of parameters. Consequently, the impact on performance and risk of a high-level data security breach is limited.

 

In environments that require heavier what-if or unguided analysis, the potential for a front-page security breach is more concerning. In this type of environment, strict usage monitoring should be enforced. Usage monitoring is always a good idea to understand data access paths and can be used to uncover new indexing or partitioning candidates. It can also help detect unwanted data access. Hence, usage monitoring is a practical means to improve performance and keep the wolves at bay. Usage monitoring is simply a camera over an ATM – it is not a security system. Strong user authorization and centralized identity management provide proactive and preventive measures that would ensure only authorized users gain access to the data.

 

Cubes and spreadmarts are typically a component of any data warehouse environment and provide users with easy and fast access to preaggregated information. If the information in the data warehouse environment is moderately distributed across multiple cubes and spreadmarts, there are limited measures that can be used to secure the data. Most multidimensional online analytical processing (MOLAP) or Office platforms do not support encryption. With most of the business intelligence (BI) tools, the data is encrypted at the database level but not en route. Users sign on to the application, and the data refreshes based on individual or group security rights. In essence, the encryption is not performed in the BI layer but the database layer, and user security authorization and authentication policies govern access to data.

 

Cubes tend to store highly summarized data, but the risk of sensitive data elements like Social Security numbers or credit card numbers still exists. Many drill down reports show very detailed reports business users commonly export the data to Excel. Once a spreadmart is initialized with sensitive data, there is no mechanism to encrypt the exported data in Excel. As a result, data warehouse owners must inventory all cubes and spreadmarts that users access directly and determine if there is any sensitive data in these repositories. If sensitive data is accessible inside these portable databases, managers should effectively remove the sensitive data from the cubes or centralize its storage so the data access can be tightly regulated.

 

The long-term solution that strikes the balance between security and performance in the data warehouse may be virtualization. The powerful impact of virtualization (i.e., virtual servers and virtual desktops) in any environment, particularly a data warehouse, includes a centralized control of servers and storage, thereby aggregating and distributing computing power to applications and users. Complete virtualization is a radical shift from existing data warehouse environments, and requires a complete overhaul on all levels of the data warehouse – network, storage, software and even how users access information. For managers in the infant stages of a data warehouse roadmap, virtualization should be explored as a component in the deployment. Managers that are beginning to address data security in the warehouse should consider the potential role virtualization can play in the long-term, but should couple this potential with the immediate reality that most application vendors do not support virtualized environments.

 

Strategically, virtualization or a data warehouse appliance may deliver the perfect balance of performance and data security. Practically speaking it is unlikely that the data warehouse can lead IT toward virtualization if it is not already in the overall infrastructure plans. Furthermore, the rewards of such an architecture may be based more on potential than track record. Until a strategic infrastructure that ensures security and performance can be planned, managers should consider more tactical measures such as limiting data volumes and processing time, selectively encrypting sensitive attributes in its persistent storage and strengthening the definition and enforcement of information lifecycle management policies.

 

References:

  1. PCI Security Standards Council.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access