Slideshow 10 Best Practices in Big Data – Data Provenance

Published
  • September 13 2016, 6:30am EDT
20 Images Total

10 Best Practices in Big Data – Data Provenance

With the increase in provenance metadata from large provenance graphs in big data applications, security takes on great importance. To conclude this series, the new report from the Cloud Security Alliance -- “100 Best Practices in Big Data Security and Privacy” – provides the following best practices for data provenance security.

Develop infrastructure authentication protocol

Why?To prevent malicious parties from accessing data. Without infrastructure authentication, an entire group of users (including unauthorized individuals) can have access to data—including some parties who may misuse the provenance data. For example, an adversary is likely to release the provenance information or data itself—which likely includes some sensitive information—to the public. How?An authentication protocol is designed as a sequence of message exchanges between principals allowing the use of secrets to be recognized. Design consideration is needed in, for instance, whether the protocol features a trusted third party and what cryptographic schemes and key management schemes are appropriate. If the number of nodes needed to authorize is extensive, the public key infrastructure can be used to authorize each worker and a symmetric key can be sent to each of those individuals. The authorized nodes can use the symmetric key to communicate because the overhead of the symmetric key is much smaller than the public key method.

Content Continues Below


Ensure accurate, periodic status updates

Why?To collect data correctly. Advances in wireless technology have increased the number of mobile devices that are now used to collect, transmit and store data. However, there may be some malicious nodes and lazy nodes in the wireless environment. The malicious nodes are active attack nodes. When a malicious node receives data transmitted from a data owner, it actively tampers with the information, eavesdrops the content, or drops the data package while sending fake data to the next relay node. Lazy nodes, on the other hand, simply choose not to store-and-forward the data due to the energy consumption required for data transmission. These manipulations of collection and transmission of information can be stored as part of a provenance record. The status of provenance records should be periodically updated.

Ensure accurate, periodic status updates

How?Trust and reputation systems can be introduced into the wireless network to address lazy and malicious node issues. Reputation systems provide mechanisms to produce a metric encapsulating reputation for each identity involved in the system. For example, if the node performs a malicious act or eavesdrops for sensitive information, the system can assign some pre-defined negative value to the node as a rating. Also, the system can assign pre-defined positive value to nodes with normal behavior, while lazy behaviors are simply assigned “zero” as a rating value. One node can also give a trust value to another node based on their interactions. This trust information can be stored as part of the provenance record, which can be periodically updated in order to reflect its most recent status. Some pre-defined threshold should be designated by the system to identify that nodes are valid. If the trust value of the node is lower than that pre-defined threshold, the node is then recognized as invalid and is not used to transmit or collect information.

Verify data integrity

Why?To ensure trust in data. In real-world environments, provenance data is typically stored on a personal laptop or in a remote database center. In turn, there may be some unevaluated risks associated with losing portions of the provenance data by accident, such as unforeseen damage to the hard drive. In attacker-active environments, there may also be adversaries who modify segments of original data files in the database. To that end, it is essential to detect whether user data has been tampered with. The establishment of data integrity is one of the most important markers of provenance security.

Content Continues Below


Verify data integrity

How?Checksums are considered one of the least expensive methods of error detection. Since checksums are essentially a form of compaction, error masking can occur. Error detection in serial transmission by arithmetic checksum is an alternative to cyclic redundancy checks (CRC). Reed-Solomon (RS) codes are a special class of linear, non-binary block codes with the ability to correct both errors and erasure packets. An RS code achieves ideal error protection against packet loss since it is a maximum distance separable (MDS) code. Another efficient way to maintain the integrities of both data and provenance information is the use of digital signatures, which keeps data from being forged by adversaries. Digital signatures are the most common application of public key cryptography. A valid digital signature indicates that data was created by a claimed sender and the information was not altered during transmission. This method provides a cryptographic way to secure the data and its provenance integrity; furthermore, the verification cost is reasonable even when the volume of data becomes extensive. New, efficient digital signature schemes will likely be developed to ensure the integrity of information in big data scenarios.

Ensure consistency between provenance and data

Why?To ensure provenance information is trustworthy. Separating provenance from its data introduces problems and potential inconsistencies. Provenance should be maintained by the storage system. If the provenance does not exactly match with the corresponding data, users cannot use the provenance data confidently. Effective use of provenance will establish a record of process history for data objects. Historical information about the data can be constructed in the form of a chain, which is also referred to as a provenance chain. The provenance chain of a document is a non-empty and time-ordered sequence of provenance records. For provenance data stored in the database, the provenance chain should be well-organized to make provenance records consistent. Otherwise, provenance information cannot be trusted if consistent record keeping cannot be established.

Ensure consistency between provenance and data

How?In order to guarantee the consistency between the provenance and its data, the hash function and hash table can be used to address this challenge. Hash map keys can be utilized to locate data. It is not advised to utilize the provenance data directly because the information is likely enormous in size and the structure of the data is complicated. Before the provenance data is stored in the database, the hash function can be used to generate the hash value of the selected data block. Users can then apply the provenance information and the hashed provenance data to build the hash table. As a basic component of the provenance chain, a provenance record will denote a sequence of one or more actions performed on the original data. In order to achieve the consistency of the provenance record, the cryptographic hash of both the newly modified provenance record and history chain of the provenance record are taken as input. The consistency of the provenance chain can be verified by checking, at the beginning of each editing session, whether the current provenance chain matches the provided hash value. As massive provenance data is generated in big data applications, users need highly efficient methods to keep the consistency between the provenance, its data, and the provenance chain itself.

Content Continues Below


Implement effective encryption methods

Why?To maintain security of provenance data. As cloud computing techniques continue to evolve, users often outsource large amounts of scientific provenance data to the cloud for storage/computation. When the data provider transmits the provenance data to the cloud servers in the way that the data is expressed in plaintext form, the transmitted data flow can be easily eavesdropped by an adversary. Additionally, the cloud server will always be considered a third-party server which cannot be fully trusted.

Implement effective encryption methods

How?One existing method to keep data secure during transmission is to transform the original data into ciphertext form. Before the provenance data is sent to the untrusted server, the data owner encrypts the information by using the secret key. Anyone who is untrusted by the transmitter cannot have the secret key and, therefore, cannot decrypt the ciphertext to get the original data. Only the authorized party can decrypt the ciphertext by using the secret key. Users can also utilize encryption to secure outsourcing of computation tasks on untrusted servers. Before users send out the high-burden computation task to the cloud server, the data owner can first “blind” the original information by using lightweight encryption technology. At that junction, the data owner can outsource the encrypted information to the cloud server to handle the high-burden computation task. Because the cloud server does not have the secret key, this server cannot disclose the original provenance data. When the computation task is finished, the data owner can use the secret key to recover the data handled by the cloud server. By using encryption, the data owner can outsource the computational task and enable confidential storage of the task and data in the untrusted server.

Use access control

Why? To prevent abuse and unauthorized disclosure of provenance records and data by malicious parties. One advantage of cloud computing is that the cloud provider has the power to share the data across different user environments. However, only authorized users should have access to shared data. The volume of stored provenance data is normally extensive in the cloud server, so under most circumstances the user (data owner) may wish to restrict information accessibility. On the other hand, the provenance record may also contain some private information, such as a user’s personal profile and/or browsing log. The adversary may offensively access and misuse the provenance record or data itself without appropriate access control being applied. Moreover, the adversary may publically disclose sensitive data, which could damage the interests of the data owner.

Content Continues Below


Use access control

How?Appropriate access control helps avoid illegal access and private data leakage by limiting the pool of actions or operations that a legitimate system user may perform. One popular access control technique is role-based access control (RBAC). In this scenario, provenance data owners store data in an encrypted form and grant access to the information solely for users with specific roles. Only authorized users have data access according to the RBAC method. Attribute-based encryption (ABE) techniques can be implemented in order to achieve fine-grained access control. Ultimately, however, a ciphertext-policy attribute-based encryption (CP-ABE) scheme is the most advisable method for cloud-computing environments. In this scheme, different users in the system have their corresponding attribute sets created according to their inherent properties. Authorized users are assigned private keys which are associated with different attribute sets. The data owner encrypts the provenance data with access structure and assigns encrypted data to those who possess privileges.

Satisfy data independent persistence

Why?To preserve indistinguishability of provenance data. When updates occur between two adjacent pieces of the provenance data, the user cannot distinguish among these pieces. This is referred to as “independence.” For example, the provenances of two segments of derived data, such as simulation results, won’t reveal differences between the data. Sometimes, a user is granted partial privileges to access pieces of provenance data. If data has not met the standards for the “independence” designation, the user is able to distinguish the difference between two pieces of provenance data. Some segments of the provenance data may involve sensitive information, which the data owner does not want the data consumer to access. “Independence” designation for different provenance records should also be achieved because some users are only able to access part of the provenance chain.

Satisfy data independent persistence

How?Symmetric keys and distributed hash tables (DHTs) technology can be used to establish and maintain independence among different pieces of provenance data. The symmetric keys are selected randomly and independently to encrypt different pieces of the provenance data. Due to differences among independent symmetric keys, encrypted pieces of provenance data are also independent. Moreover, different pieces of the same provenance data should not be stored successively because they have the same patterns. In order to distribute the different pieces of provenance data, users can utilize the distributed hash tables (DHTs). Segments of provenance data can be retained, in a distributed fashion, with DHTs. Moreover, the hash function can be recalled to help different provenance records achieve independence. The individual provenance record should be hashed, and then different hashed provenance records can be reconstructed as a chain. The modified individual provenance record only affects the corresponding hashed component and will not impact other components in the hash chain. That is to say, different parts of provenance records can achieve independent persistence.

Content Continues Below


Utilize dynamic fine-grained access control

Why?To allow only authorized users to obtain certain data. Fine-grained data access control provides users (data consumers) with access privileges that are determined by attributes. In most real-world cases, user-assigned privileges and/or attributes vary with time and location, which may need to be incorporated in access control decision.

Utilize dynamic fine-grained access control

How?Using attribute-based encryption, fine-grained access control can be applied to encrypted provenance data. In order to reach the dynamic property, users can introduce the dynamic attribute and weighted attribute into the attribute-based encryption. The dynamic attribute can be described as a frequently changing attribute, such as a location coordinate, while other attributes are considered weighted attributes. These attributes have different weights according to their importance, which are defined in the access control system. Every user in the system possesses a set of weighted attributes, and the data owner encrypts information for all users who have a certain set of attributes. However, a user’s private key has a specific kind of weighted access structure. In order to decrypt a message, a ciphertext with a set of weighted attributes must satisfy the weighted access structure. The weight of the attribute can be increased or decreased to reflect the dynamic property.

Implement scalable fine-grained access control

Why? To protect large-scale provenance data. A considerable amount of provenance data is stored and exchanged in databases. Database systems allow data consumers access to various types of provenance data in accordance to access policies designed by the data owner. However, an access policy should be scalable in order to meet the ever-growing volume of provenance data and user activity within a group. If the access policy is not scalable, any future policy modifications that may be necessary will be difficult to implement.

Content Continues Below


Implement scalable fine-grained access control

How?Attribute-based encryption can be utilized to achieve fine-grained access control, which was introduced in the previous section. To achieve scalability, data owners can introduce a semi-trusted proxy that exploits the “so-called” proxy re-encryption technology. This will allow the proxy to decrypt the message and re-encrypt it with the new access structure. Since the proxy is “honest-but-curious”—which indicates it will try to gather as much private information as possible based on possession—users cannot allow the proxy to fully decrypt the message. Ideally, users would control the policy while the proxy is utilized to produce a partial ciphertext for the policy specified by that purpose. When access structure attributes in the ciphertext need to be modified by the data owner, the newly specified policy is given to the proxy. Initially, the proxy uses the previous partial attribute set to decrypt the partial ciphertext, and then applies the newly specified policy to encrypt the partially decrypted ciphertext. Because the proxy does not have the full attribute set, the proxy can only partially decrypt the message and not the entirety of the provenance data.

Establish flexible revocation mechanisms

Why?To prevent access by unauthorized entities. With such high volumes of provenance data stored in databases, maintaining effective access control is a constant challenge for data managers. Access to the data can be easily abused (even if the privilege of accessing the data is expired) when the access permission is not appropriately updated. (For instance, a fired employee could still abuse his access privilege if his access is not revoked in timely manner.) In some cases, data may only be valid during a particular time interval. Data managers also need an efficient data revocation scheme after the data has expired.

Establish flexible revocation mechanisms

How?The process of revoking user access permissions can be efficiently streamlined if a central authority redistributes the key. Due to the many user keys involved system-wide, it is best to establish a central authority for key management. When a user becomes invalid, the central authority can generate new decryption keys and distribute them among remaining valid users. The ciphertext stored in the system is also decrypted with the previous key and then encrypted with the new key. In some cases, this scheme is inefficient because the computation and communication cost for the central authority is too high. Data managers can introduce a proxy by using the proxy re-encryption method (presented in the previous section) to reduce the cost for the central authority when the key is updated. In order to efficiently revoke provenance data access, time-specific encryption (TSE) techniques can be applied to encrypt the data. TSE is a type of public key-based encryption with additional functionalities. TSE can specify a time interval while the data consumer’s private key is associated with a time instant, meaning that ciphertexts may be decrypted successfully only when the time instant falls into the time interval.