Is Data Science?
Alexandria, Virginia was settled on the west bank of the Potomac River in 1695. It became a slave-trading port and, later, an early dividing line of the Union and Confederacy. Today it's an affluent Washington, D.C. suburb on the southern perimeter of the capital beltway, sixty-four miles of highway circling the bulls-eye of power and influence in U.S. government.
Nighttime traffic between Maryland and Virginia is fast to reckless, but daytime gridlock reveals an expanse of high rises stretching out to Alexandria's West End.
Just out of view from I-95, Suzanne Yoakum-Stover and her research partner Andy Eick are up early in an apartment suite turned office, boiling water for tea and writing software code. They are partners in a tiny consultancy called Mission Focus that builds systems for the military intelligence community in the Department of Defense. They also run a nonprofit called the Institute for Modern Intelligence. Both companies were born for fighting terrorism and protecting American lives, though IMI has broader plans.
Yoakum-Stover's passion is ultra-large scale data. If that makes you think of the buzz you're hearing about "big data," stop, and think much bigger. Her vision of ultra-large scale is one enormous, diverse "Unified Dataspace" on an order of size equal to the entire Internet and then some, with a whole ecosystem of processing to study it (see sidebar: "What is Ultra Large Scale Infrastructure?").
She's already built working foundations of a Dataspace system in an intelligence project for the U.S. Army. Despite one numbing setback, the rest is in the works, where it is revealing hard new challenges, but no fatal flaws. Should it become reality and work as described, it likely represents a new phenomenon in data diversity and sheer scale.
High-tech anti-terrorism grabs your attention, but to understand the story, you need to set aside technology for just a moment to see the mindset that makes this person unlike any CIO or startup entrepreneur you've met. She aspires to be neither, because she sees ULS as a pursuit, not a product.
Stover wades into technical minutia as quickly as you ask, but there's none of the eccentricity you'd expect of a person this educated and outspoken. Disarmingly, she's quickly Suzi to friends, and lights up when explaining ideas, especially ULS. You can hire her brainpower, but it's the sum of her background and her physics Ph.D. that makes her talk like a scientist advocate about the topic of data.
The analogy she is explaining to me in her office is the transition from classical to modern physics and large-scale institutional research that all started a century ago.
"Military intel these days is still very archaic because our work is isolated and narrow," she begins, "and I have this gut feeling we're on the brink of a transition like we saw in physics that will happen for intelligence."
She points at a bookshelf and explains the difference. "We have these wonderful disciplines for physics, biology and less pure things like economics, which may be the closest thing to where we're going with information," she says. "Have you seen how fat those graduate textbooks on economic theory are? We don't have any books like that in the intelligence domain, and we need a public institution to study it."
That is her dream for IMI, something like a Fermi Lab or CERN, where ULS Dataspace infrastructure would draw engineers, academics and scientists to "test their biggest, baddest" algorithms and visualizations. ULS would provide "one enormous cauldron seething and bubbling with applications, algorithms" and an absolutely unheard of scale of data.
But first, it all has to be built and it all has to work. And you can't jump to the end of the story. You need to unravel her background, a Long Island child of the '60s who took her physics doctorate from Stony Brook University to a post-doc associate professorship at the University of Wyoming. She likes wide-open spaces, and Laramie filled the bill until experimental computation in atomic modeling turned into an unfunded dead end.
Stover migrated to the canyons of Manhattan and an artificial intelligence startup using natural language processing that went bankrupt in the dotcom boom. "Building artificial intelligence is a terrible business model," she says. "It's no business model."
She next tried database matching. "What could be more boring than that?" except it used AI and Bayesian statistics and estimation, a learning experience but eventually an intellectual roadblock. She learned graphics and visualization at another startup that went belly up. Discouraged, she nearly became the best high school physics teacher anyone would ever be fortunate enough to learn from.
But a semester into her master's in education, her AI friends reconnected and pulled her into Object Sciences, a small company with some government contract work. Natural language processing and database matching was a solid foot in the door and gave her new options to contemplate. After SAIC bought Object Sciences, she moved to the nonprofit Potomac Institute for Policy Studies, where her science background could shine through in policy development for promising technologies.
When I met her in 2009, Suzi Stover was a civilian consultant building ULS infrastructure for the Distributed Common Ground System-Army. DCGS-A is the toolkit fielded by the Army to help everyone from soldiers to think tank analysts battle terrorism.
Ideally, intel units would like a system that can dive into all kinds of data at once, one that doesn't take months to create - or recreate for another use, as she explains.
"If I want to track a bad guy, I would look at all the things that can be tracked. I would track his credit cards, I would track when he swipes a badge at a parking lot, I'd track his EZ-Pass and I'd want some analytics that transform the relevant parts of the data into dots of locations and times."
That's one simple use case for intel data. Compared to a domain like business intelligence, military intelligence is murderously complex, with extremely diversified needs that call for one data infrastructure to support them all. This is what the Dataspace proposes to be.
DSGS-A is where she met her future research partner Eick, a globetrotting visualization expert brought in to apply geospatial mapping technology to the Army project. Almost immediately, he bought into the Dataspace's potential for tapping into "one big soup of stuff" for intel, a theoretical capability to track any amount of data in one big horizontal swipe without the need for a unique project model or schema. It could be selective, like an observatory pointed at a galaxy in a constellation.
The idea grew while Stover worked as technical lead in a project to build a secure data vault to let war fighters, from the lowest to highest security clearance, authenticate their credentials and see all the data they are entitled to and no more. More important, it would filter across many domains or types of information.
From these bread crumbs, Suzi Stover spent the next year on an idea to crack open the space between all the disconnected silos and models we create to describe data. Her idea for a new Data Description Framework and ULS would tear down the walls, decouple data from the models and storage that confine it, and turn all that into an abstraction, a hands-off way to dip into all of the data, semantics, and interfaces we have without distorting or losing bits of it in the process.
In doing so, it would expand the amount of data by a factor of at least four or five and the computing and storage needed to manage it. "So there I was, preaching these ideas to the Army and they basically paid my salary as I looked for a place to apply this," she recalls. "There were some pilot projects, and then all of a sudden, cloud computing shows up."
A lucky coincidence had arrived with massively parallel processing and inexpensive infrastructure that could be built or rented to allow enormous computing and storage to be summoned at will. Because of the cloud, ULS wouldn't have to excavate anything comparable to the 17-mile tunnel beneath Switzerland used to smash atoms at CERN.
The top brass took a meeting and liked Stover's ideas. "There was a lot of excitement and I briefed the Army's G-2, the top intelligence officer, a brilliant three-star general who understood our data problem. I had 15 minutes, and suddenly our tiny low-profile project was understood."
For the next nine months, 10 developers built the core of a Dataspace system using cloud computing and storage to "ingest" structured and unstructured data of any format into a unified data store. The cauldron was boiling, and work was ramping on the top layers for unified storage of data models and the disparate interfaces, the "lenses" through which users could "see" the data.
Then suddenly, in February, the Army "decided to go in a different direction." Our conversation goes silent for a moment before she resumes. "It was beautiful work on a beautiful system but the government civilians decided to take a more conservative, if not legacy, standard of ETL, system integration and data integration."
Pressed on the point, she defends the work done in a flash of resolve, the only fleeting exception to her usual didactic cheeriness.
"Look, there is a tremendous pressure to go with your history, your team and what they understand. A lot of the swirl at DCGS-A was the result of contortions over the language being used, and I tried to prevent that from happening. We had a very rigorous and well-defined language that was co-opted for another agenda."
The subtext would be clear to anyone around the business of government. In the defense industry, where big contractors carve out massive multiyear budgets of appropriations from Congress, there is pressure to "refresh" spending cycles and update familiar infrastructure.
From another view, the machinery is simply so big that it exists mainly to perpetuate itself. Old Town Alexandria is a tourist destination known for cobblestones and quaint brick courtyards, but the residential mix is dominated by professional and technical services and vested administration jobs in government and the military. It sits just downriver from the Pentagon, and like the rest of the beltway, it's peppered with lobbyists and influence that keep an eye on the status quo.
A small consultancy or nonprofit will likely come into this setting under the wing of a larger established contractor that will use it for its own purposes, big idea or not. That's not exactly what happened to Yoakum-Stover and Eick, and they are quick to praise the talent and support of senior military leadership. There are pockets within agencies right now, they say, where people see the vision and are eager to make it happen.
In any event, the wheels had turned. She stayed on as a consultant until July before disengaging entirely, but that was not the end of the story for Stover and Eick. They used the interruption to press their own "refresh" button and took the work back in their own hands. Both are talented software engineers who love writing code and seem to relish the tea-spiked 7 a.m. sessions that run Monday through Thursday for 12 hours or more.
"We went back to brass tacks, redesigned the data architecture and the whole system. Software developers will tend to make decisions for expediency and reuse what they already know. Conceptually, the fundamental ideas are the same but the implementation changed profoundly because everything changes when you start writing code. You see things you don't see otherwise."
On Fridays, she stops midday and drives back to her family and their farm in southern Pennsylvania. She still likes wide-open spaces, but gets lost writing test code deep into the evening.
Eick feels liberated to be back on the design side instead of managing feedback for a team of others. "When you get closer to the code and recognize the tradeoffs other people chose, you can decide not to make those tradeoffs."
Now, there are signs the intel communities' interest in Stover's ULS Dataspace is rebounding, and Mission Focus is hiring. It feels like things are moving in the right direction.
If it gets far enough, it could be the launching pad for her vision of IMI as an institute that will further the work she cherishes, if and only if it can be at arms length from any corporation, government or overriding dogma.
"If you're going to build a solution that cannot be viable in a freely evolving interdependent collective of people, politics, systems, cultures, very little of which will ever be under your control, then that solution won't take you far. Think of what it actually means to support the kind of scale and diversity we're talking about."
Listening to her own words, she pulls back so as to not sound too self-engrossed, though she is always offering thoughts on a high plane. We chat about models for institutions and endowments and laugh conspiratorially at the irony it would be if the Bill and Melinda Gates Foundation were to found or fund such a place.
Though she's known in small circles, Suzanne Yoakum-Stover is a lighting rod for nonconventional viewpoints she's prepared to back up. She'll admit that some traditionalists think she's gone off the deep end and that another fringe is more likely to shave their heads and follow her around.
She's not here for acolytes, just to challenge what we think we know for the sake of her training in ways that could save American lives, and maybe a small vanity to wonder if, a century from now, people will point back to her work as a tangible advance. She's sure it's going to happen, whether she sees it all or not.
"Head in the clouds, feet in the dirt, that's our motto. I don't want to make widgets, I want to walk through a door we'll look back at and see it was this new thing we couldn't understand before."
Editor's note: This article includes a sidebar, "What Is Ultra Large Scale Infrastructure?," which you can read here.
Watch Suzanne Yoakum-Stover's keynote address at the 2010 25 Top Information Manager event in New York.