(Editor's note: This is a sidebar to the main story, "Is Data Science?," which you can read here. Watch Suzanne Yoakum-Stover's keynote address at the 2010 25 Top Information Manager event in New York here.)
A person familiar with data management might need an afternoon to digest the basics of ultra-large scale systems, which, by the way, is not something consultant and researcher Suzanne Yoakum-Stover invented or engages in solely.
You can Google "ULS Study Report," "The Fourth Paradigm" and "design patterns" to see where her tenets of ultra-large scale data come from. You'll find references to data sciences, but there's nothing in the way of a visible ULS community and there is no "ULS For Dummies," meaning that, after a point, you need to shut up and listen.
A ULS approach quickly gathers complexity, even for data professionals, though the problems that call for its creation are very familiar. The big problem with how we treat data, Stover says, is that we send it to places where it tends to stick where it lands and takes the shape of its container. "That basically means you can't modify the model or use the data outside of its model," and there is no single uber-model or ontology big enough to describe ULS data.
When you draw it up, ULS Dataspace infrastructure is a technology stack with universal storage layers for 1) unstructured data, 2) structured data and 3) data/knowledge models. A fourth layer represents the means by which one can look into this Unified Dataspace according to a chosen perspective or view to the data.
Yoakum-Stover had a head start on the ULS data problem with work already done for the Army that unified unstructured data within a simple document-style data store. By representing each artifact with a tiny bit of metadata and a bucket for the artifact itself, a kind of universal store for unstructured data was achieved.
The next challenge would be to build a universal store for structured data on top of this. That is tough to create, because we're trained that the "right" way is to clean and shape data into one big neat schema. But one schema can never be made big enough to represent an entire enterprise, much less the span of ULS data. Databases are always created for a particular purpose, and we integrate them by extracting, transforming and loading data from elsewhere. We move and transform data endlessly, losing information at every step along the way, only to lock it into another set of containers.
There is waste and fatalism in this approach. "Enterprises are impoverishing their data, scraping down the richness to get their hands around it," Stover says. The key to overcoming repetitive data integration is to unify the data without integrating it.
Regardless of how it is structured, all structured data has things in common, like entities and associations with other data. What Stover needed was a minimal set of universal "things" to describe structured data that could be used to display or capture data from any model.
Her solution is called the Data Description Framework, the greater workings of which require too much space to detail here. DDF might be interpreted as a spin on an existing idea called a Resource Description Framework, but it is not a model or a model of models. DDF is an abstraction over models that takes data and semantics ingested from any number of silos and implements these within the storage model of your choosing, a graph, a relational database, a list of objects or a hierarchical tree. Instead of forcing a schema with descriptions such as "owner" or "address," it uses five abstract elements to describe mentions of data: sign, concept, predicate, term and statement.
DDF seeks to minimize transformation of data, and instead, decouples data from models and models from storage schema. It takes the data and "throws it on the Unified Dataspace Floor," where it can be searched, explored, enriched, and drilled into regardless of the silo boundaries. It avoids integrating or harmonizing data models up front. The native data and semantics become one enormous unified framework, complete and undistorted. Upon this framework, the business of data integration can then be pursued in new ways. In Yoakum-Stover's estimation, a whole new species of applications can be built using unified data without ever having to engage in the onerous task of data model harmonization.
The top half of this stack is still being written by Stover and fellow researcher Andy Eick, the two of whom work with the Department of Defense via their consultancy, Mission Focus. There are no insurmountable barriers in sight, she says, but it will reveal a lot of "DARPA-hard" problems (referring to the defense department's think tank agency for very difficult challenges) her work did not set out to address.
"People who see it will tend to criticize it for all of the hard questions it raises, for example, how to support real-time systems or keep track of versions when ingesting data from dynamically changing sources. Nothing I am doing right now directly addresses that."
Standard technologies and approaches will grow around ULS and DDF, but she's sure the time is also right for a rigorous theoretical scientific approach that contemplates the sphere of data in its totality.
"Because you're bringing so much data together in this cauldron, the data management problems become huge. You've just taken away the artificial barriers that segmented the mess and when you take away that accidental complexity, now all the problems are really revealed."