Overcoming challenges to securing the modern data infrastructure
The recent evolution from storing data in a data warehouse to using a hybrid infrastructure of on-premises and cloud data lakes has enabled tremendous agility and scale, but it has also created a security and privacy risk that current strategies don’t address.
Organizations concerned about the quality of their data, protecting their brand and intellectual property, and complying with evolving privacy regulations must understand how the modern infrastructure has broken the relationship between data and metadata and how this impacts the quality and security of their data.
First, a couple of basic definitions. Data refers to the structured or unstructured information (transactions, PDF files, etc.) stored in a database or data warehouse. Metadata refers to information about the data, such as the structure of the data (tables, views, etc.), the source of the data (e.g., where it was produced), user access policies, and whether the data contains personally identifiable information (PII—credit card numbers, addresses, etc.).
The Old Data Warehouse Model
Prior to the rise of the cloud and the advent of big data initiatives, data was typically stored in a data warehouse. Data and the metadata catalog were managed by the same system and were essentially inseparable. No data could be added to the data warehouse without the corresponding metadata, ensuring strict enforcement.
In database parlance, this is “schema on write.” That is, the metadata was defined for the data when it was first added to the database. This helped ensure data quality because every application written to access the data had the exact same view and understanding of the data. It made managing security and privacy easier because, based on the metadata, the data warehouse could control who had access to what information.
The database also managed transformations, such as masking credit card numbers. This consistency made it easier to build new applications and tools and manage performance. Data consumers also had high confidence in the quality and accessibility of the data.
The rise of Hadoop and the Data Lake
The challenge with the data warehouse model arose from the need for greater speed and agility. For example, if a mobile phone application with hundreds of thousands or millions of users needs to constantly feed changes to the database, it may be impossible to enforce strict metadata at scale without a cost-prohibitive investment in infrastructure.
Hadoop was developed to solve this problem. With Hadoop, data is added to the database without metadata. It is simply “raw data.” Optional metadata can then be applied to the data by an application when it reads the data. This is called “schema on read.” The benefit of this approach is that data can always be added to the database without constant metadata enforcement, making the system very fast, highly scalable and cost-effective.
The problem with this approach is that it becomes much harder to ensure quality, build new applications and tools, manage performance, and protect and secure the data.
Since metadata enforcement takes place when an application reads the data, every application reading the data must enforce the same metadata the same way. Otherwise, inconsistencies arise. This makes it much harder and more time consuming to write applications and ensure consistency among different applications. It is especially difficult when different teams build applications for different use cases.
For example, one application may be written to require access to the last four digits of a credit card number to verify a customer’s identity while another application may be written to require the credit card number be anonymized via tokenization, which can then be combined across datasets for a customer 360 application. This type of inconsistency across applications makes security management very difficult.
If you’ve heard database administrators refer to their Hadoop data lake as a “data swamp,” this is why.
Attempts to Solve the Metadata Challenge
Companies generally attempt to solve the metadata challenge in one of two ways. The first is to use a new application layer, not the database layer, to force metadata and data to be in sync. This new application layer is responsible for ensuring the data and metadata are always combined and used properly.
While this can work, it is expensive to build and maintain, and the onus shifts to each application, so inconsistency continues to be an issue.
The second approach is to make multiple physical copies of the data and embed a variant of the metadata with each copy. This approach eliminates the disconnect between data and metadata, but it comes at a big cost to flexibility. For the purposes of security and privacy, the proliferation of copies of the data (for example, an additional copy for each privacy level, which can be different for each data consumer) can be prohibitively complex and expensive.
Getting It Right: Enforcing Data Security and Privacy Consistently Across All Applications
To ensure privacy and security, the metadata and data must be managed in sync by the same system, no matter which application is accessing the data. The system must be able to enforce “schema on read” and manage access controls and transformations.
This means that instead of sitting in the “data flow” between each application and Hadoop, it needs to sit in the “data plane” so it is automatically accessed by all applications and automatically manages access and transformations, eliminating the need for developers to focus on this.
This is where the industry must focus its attention, and this is what organizations must demand of their vendors.
Because of the fundamental split between data and metadata in the modern infrastructure, enterprise developers have been forced to try to handle the challenges of data quality and security in the individual applications they write. This is an extremely complex and laborious process that has met neither goal.
Current attempts to solve this via solutions that sit between an application and Hadoop have improved developer usability and productivity, and they can ensure data consistency, but they are wholly inadequate for security. Only a new approach that sits in the “data plane” and that enforces metadata creation on write, manages user access, and performs data transformations will enable organizations to ensure data quality, protect their brands, secure their intellectual property, and comply with evolving privacy regulations.