Does Hadoop Mean the End of the Data Model?
The development of Hadoop and the Hadoop Distributed File System has made it possible to load and process large files of data in a highly scalable, fault tolerant environment. The data loaded into the HDFS can be queried using a batch process provided by MapReduce and other cluster computing frameworks, which will parallelize jobs for developers by distributing processing to the data located on a pool of servers that can be easily scaled.
The Hadoop environment makes it easy to load data into the HDFS without needing to define the structure of the data beforehand. The usage of the Hadoop environment naturally raises the question:
Does Hadoop mean that data models are no longer required?
The answer to this question lies in the purpose of a data model and the needs of this type of environment.
The Hadoop Method
Assuming data is processed using MapReduce, the definition, or layout of the data structure that was loaded into HDFS, is defined within the MapReduce program. This is referred to as “late-binding” of the data layout to the data content. This approach frees the developer from the tyranny of the early-binding method, which requires that a data layout be defined first, often in the form of a data model. This is the approached used by relational databases, which also supports constraints for enforcing business rules and data content validation.
The natural result of separating the data content from the data structure is that the MapReduce program becomes the place where the two are linked. Depending on the data processing needs, this may or may not be a complete data structure definition. In addition, each developer will define this mapping in slightly differing ways, which results in a partial view that makes unified definition hard to assemble.
The late-binding of data content to the data structure essentially places the developer as the middleman between the data and the data consumer since most data consumers are not MapReduce trained. Hadoop requires that the developer know how the data is laid out in the file, its format, whether it is compressed or not, and the name of the file(s), every time a new MapReduce program is developed. This late-binding approach requires the same work be repeated over and over again.
Additional short-comings of the late-binding approach inclue:
- Data storage changes require changes in Pig and MapReduce jobs.
- Data storage changes force a painful transition process to take advantage of any improvements.
- Different tools do not share the same definition of data types, making the sharing of data error-prone.
- The file system can become a dumping ground, resulting in data being difficult to manage.
The HCatalog Method
Hadoop and the HDFS provide the means for users to store their data content. It gives them the ability to “load and go” when they are processing a single file. HDFS, however, does not help them determine a file’s layout or content. A MapReduce job is tightly coupled to the data layout, and any change in location or data type - or if it becomes compressed - can impact these job specifications. This forces the HDFS user to contact the producer of the file and get the revised file definition or “schema.” If these files require more sophisticated processing, development becomes more time consuming. Further, different tools each have a slightly different notion of data types, making it a challenge to interpret the data correctly. The fragmented nature of the Hadoop file system and the need to be a developer in order to perform queries against the data led to the development of HCatalog, a table and storage management layer for Hadoop. HCatalog provides an environment where:
- A central location is used to define a data structure for data content stored in the HDFS.
- A data file layout definition is maintained and can be kept current.
- The data producer does not need to know where or how the data is stored.
- A schema can be shared by frameworks such as Pig, Hive and MapReduce.
- Notification of data availability is possible.
- Data producers can change the data layout without affecting data consumers using the old layout.
- Processing of old and new data layouts can be accessed by different processes.
The provision of the Hive Query Language has made the data content more accessible for a data consumer who understands SQL. Having these data structure definitions in HCatalog is certainly a step forward, but is this the same as a data structure data model?
Data Structure Data Model
Consider following working definition of data modeling as a guide.
Data modeling is a process used to analyze, define and identify the relationship between data objects needed to meet business requirements. A data model is the description of these data objects, properties and relationships that facilitates communication between the business people defining the requirements and the technical people defining the design in response to those requirements.
With that definition in mind, questions to consider are:
- Does the HCatalog repository qualify as a data model?
- What value can a data model provide to big data?
- What is included in a Data Structure Model?
The creation of an HCatalog repository does not fully qualify as a data model because it does not define how one entity relates to another. Without these relationship definitions and the analysis that is part of a data model definition, the overlap between data files is unlikely to be confirmed. The result of the analysis required to produce a data model would determine, for example, if a Customer Number is the same as a Customer Identifier or a Customer Code in different data sources.
This missing step results in the HCatalog repository remaining an inventory of data structure definitions.
Data Structure Model
The Data Structure Model is a valuable addition to an enterprise data architecture methodology, as it provides a link between the enterprise data model and the HCatalog repository. The DSM would not be a logical or physical data model as is normally defined in an enterprise data architecture. Its purpose is to source the data structures contained in the HCatalog repository and be the link to the enterprise data model. The DSM would have the following characteristics:
- The DSM would be used to generate the HCatalog file definitions.
- The DSM would provide data source definitions to an enterprise data model.
- The DSM would contain some data source definitions used only by HCatalog (i.e., log file).
- The DSM would use “original” names.
- The DSM would “relate” data sources to show hierarchy and name overlaps using Role names.
Data Structure Model Role
The Data Structure Model serves as the link between what appear to be two disparate worlds: the big data world that is faced with high velocity, volume, variety and eventual consistency of data, and the information systems world that focuses on constraint, connection, content and ACID compliance. The data source definition would first be added to the DSM, which would generate the statements required to define it in the HCatalog repository. The data content needed for analysis would be loaded into Hadoop (HDFS), and processing can be initiated when HCatalog has been updated.
The same data structure definition can be provided to an enterprise data model as a data source definition. This data source can be analyzed and mapped to the appropriate enterprise data attributes. It is very likely in an enterprise data architecture environment that this source mapping has already been done. It is not required that every file in the DSM be provided to the EDM. Some data structures will only ever be processed in the Hadoop environment.
The mapping of data sources to EDM data attributes will now provide a means to show these relationships and provide valuable metadata for ETL processing.
This is illustrated in the diagram.
The use of Hadoop and distributed programming frameworks, such as MapReduce, for processing large volumes of data has provided a valuable tool to analyze large volumes and varieties of data content, especially unstructured data. The addition of HCatalog, Hive, Spark and Impala have opened up this environment to a broader base of data consumers, yet these improvements did not provide the means to reflect the relationships between data sources or to relate them to enterprise metadata.
The use of a Data Structure Model provides a means to capture data source definitions, capture the appropriate metadata and define any relationships that exist between the data sources. The Data Structure Model can provide data source definitions to both the big data and enterprise data architecture worlds.
The use of Hadoop has not brought an end to the need for data models but rather requires them to provide a connection to enterprise data architecture environment.