In last month's column, we started looking at abstracting the ways that we exploit meta data for the purposes of access and data manipulation. This month I want to focus on the basic class concept - a single representation of the binding of values to attributes associated with a single data instance within a data set.
Everyone associated with the management of data sets becomes partial to his/her own milieu - the database analyst thinks about records and tables, the Web services architect thinks about XML documents, the COBOL programmer thinks about VSAM files and the object-oriented programmer thinks about classes and objects. Yet in most situations, each individual is thinking about a particular instantiation of a data instance that is relevant to his or her specific operational context. The basic notion of each representation is a single entity with a collection of attributes, each bound with a specific value. In turn, many of these entities are viewed together as a data set.
Different object classes have different sets of attributes, but conceptually the abstract view of a data instance will be a single object with an enumeration of (attribute, value) pairs. For each class of data instances, the names of the attributes remain constant, although each instance is likely (and in some cases is required) to be distinguished from all other instances in a set. This means that the list of data attributes is static for the entire class of instances, but the bound values are specific to each specific instance.
In addition, one might want to be able to consolidate data instances that are represented in the different physical formats. In other words, one might want to compare rows in a flat file to records in a relational database management system (RDBMS) for consistency, as a prelude to loading the flat file into the database. Fortunately, the enumeration of (attribute, value) pairs should be able to facilitate this comparison for most physical representations.
Figure 1 shows the methods (or functions) to which we would subject our data instances. It is relatively easy to satisfy the need to manage (attribute, value) pairs in the ways that we have examined. We need an object that allows one to quickly find the value of any named attribute, and a hash table provides this flexibility. For the non-programmers, a hash table is what we could call an associative directory consisting of keys (used for indexing) and values associated with those keys. The hash table is indexed directly using the key and provides a predictable and efficient constant-time performance to access the value associated with its key. Hash tables meet our needs exactly because each attribute name becomes a key into the hash table and each attribute value is indexed via the attribute's name.
Data Instance Methods
Luckily, in many object-oriented programming environments, hash table classes already exist, making the implementation relatively straightforward - we would derive a new DataInstance class from a HashTable class, which transparently provides the functionality we need. The complications arise from the different sources that are used for instance representation, how data is transformed from the original source and how data instances are transformed into a target representation. For example, while translating a record from a database or a flat file into a data instance seems inconsequential, transforming an XML document reflecting deep hierarchical structure into the data instance paradigm is a bit more complex.
Even with the fast time for accessing an attribute's value, there is a potential performance issue associated with using hash tables to represent a data instance, and this comes directly from the way hash tables are implemented. In the hash table, each attribute is represented as a key and an associated value, and both key and value are kept in the data structure. In a small number of data instances this is not a significant problem, but realize that because each object holds the set of attribute names, the required memory space quickly adds up. In other words, if one were building an application to review thousands or millions of data instances at the same time, the repetitive storage of the attribute names in each object would introduce a serious memory requirement.
Alternatively, there are ways to optimize for memory usage at the expense of time performance. For example, one could manage a single copy of the attribute names that maps from the name to an index into an array of values for each data instance. While this reduces the space requirement, the additional indirection needed to get the array index introduces extra time needed to access the elements. Either way, though, using this kind of representation to abstractly manage data instances from different sources provides a seamless framework in which data sets can be analyzed, merged, compared or transformed. In future columns, we will explore how to create these kinds of data instance objects from different source frameworks.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access