As organizations look to big data to spot trends – and opportunities – they’re finding it can take days, weeks or even months to extract and assemble data that, once compiled, is already out of date.  This is especially true when merging existing data with new resources, such as location intelligence within geographic information systems.

 A GIS collects, analyzes, and delivers images of geographic-based information. An energy analyst might use a GIS to spot oil and gas exploration sites, a homeland security analyst might use the technology to spot potential terrorist targets, and a medical research center might tap a GIS to spot health-problem hotspots.

Duke Medicine, which incorporates Duke University School of Medicine, the Duke University School of Nursing and the Duke University Health System, wanted to go a step further.

Duke collects mountains of data. It runs three hospitals and hundreds of clinics, looking after some 2 million patients, all with unique electronic health records. The health system thought that if it could marry its EHR data with a GIS it could give its clinicians the ability to pull information on certain conditions, match that to geographic locations, and predict – on demand – which people within a population are likely in the future to be diagnosed with a particular ailment.

To get this big data project off the ground, Duke hired Sohayla Pruitt two years ago as its Senior Geospatial Scientist. Pruitt, who has a GIS master’s degree, came to the health system after a stint at NASA’s Goddard Space Flight Center and at a Department of Homeland Security funded startup called Spadac.  Her experienced taught her the importance of breaking out of the old project-by-project model of gathering data and building static maps. Her idea:  have prebuilt data and modeling at the ready and allow researchers to make a quick analysis of any given problem at hand.

“I thought, wow, if we could automate some of this, preselect some of the data, preprocess a lot and then sort of wait for an event to happen, we could pass it through our models, let them plow through thousands of geospatial variables and [let the system] tell us the actual statistical significance,” Pruitt says. “Then, once you know how geography is influencing events and what they have in common, you can project that to other places where you should be paying attention because they have similar probability.”

What Duke and Pruitt built is a big data system that does just that. It allows Duke researchers and clinicians to select, visualize and predictively study any group of patients with any healthcare issue in real-time. The front end of the system is DEDUCE, a home-grown business intelligence tool built on the .NET framework. It runs against an enterprise data warehouse fed by Oracle and Greenplum databases. Esri’s ArcGIS Server platform and JavaScript allow for geospatial visualization.

“When we visually map a population and a health issue, we want to give an understanding about why something is happening in a neighborhood,” says Pruitt. “Are there certain socioeconomic factors that are contributing? Do they not have access to certain things? Do they have too much access to certain things like fast food restaurants?”

‘Big Geo-Data’ Automation

Despite the expectations set by web-based maps, the process at Duke is nothing so easy as tapping a name or address into a big search engine.  “We can’t just serve up data and overlay streets and demographics for analysis.” Pruitt says. Plus, healthcare information is sensitive and all the work must happen in Duke’s enterprise data warehouse behind a secure firewall.

This is where automation comes into play with what Pruitt calls “big geo-data” elements, layers of presorted geographical data, processing and standardization that “premodel” and “precorrelate” health care scenarios for researchers.

Duke starts with an automated geocoding system that amends and verifies every address entering its healthcare system against a USPS database for accurate spelling, abbreviations and nine digit zip codes. The standardized addresses are next passed through a commercial mapping database and geocoded to a street- or rooftop-level of accuracy (any lesser accuracy is discarded).

With latitude and longitude residence data, geographic boundaries of U.S. census block group IDs, census tracts and metro statistical IDs are overlaid. With a block group ID, Duke can then import all the U.S. census data (including median income, average commute time and transit options, the percentage of people with high school and college degrees and the percentage on public assistance) that is reported for every block group in the U.S.

What Duke gets is an environmental comparison of socioeconomic indicators at a block level (usually 100 to 500 people) to an individual or group of patients. More information arrives from the Census Bureau’s ongoing American Community population surveys, which aggregate and apply more layers of information – such as ages, disabilities, health insurance – to every block group. Between postal and census data, Pruitt says, Duke can physically plot about 10,000 data elements, none of which would have come from a doctor’s appointment.

All that data is available now to Duke researchers, who can select a cohort (grouping of patients, e.g., diabetics), filter it and analyze it in regression models that determine which socioeconomic and other variables are actually relevant to their group.

“It’s a way of taking the bias out of traditional analysis where a researcher says, 'Let’s see if median income or public assistance plays a role,' and you only have the benefit of a handful of variables,” Pruitt says. “Instead of that, you let your statistical models tell you what’s going on and where the true correlations exist.” 

Big Data Results

Pruitt sees the system’s benefits initially extending to researchers and clinicians. “Visualize it like this: A patient comes in with certain symptoms, you run probability models on those symptoms, but then you also have the geographic influences that let you visualize an outbreak of flu in a neighborhood,” she says. “This is valuable information you could certainly serve back to the doctor and the community.” By extension, a third constituency is the patients themselves, who might better see how environmental conditions affect diabetes or heart disease.

Duke is now working on a proof of concept, developing algorithms for map locations and patient proximities. In the setup, smoking behavior has been correlated with different geographic signatures to predict where people are likely to smoke, which would be helpful in real-time interventions.

Pruitt also is creating on-demand models that scale to the full population of electronic health records and could be used for any number of purposes, such as tracking outbreaks of food-borne illness to routing patients to the most cost-effective facilities for hip replacements.

“It’s easy to visualize or just say, ‘Oh, this person lives in a low income neighborhood with lots of fast food restaurants.’ You could probably do that very quickly,” she says. ”But the only way to really understand the statistical significance of what’s going on and where else it’s happening or going to happen is through infrastructure development, by pre-downloading that data, prepping and pre-relating that data to every address and every EHR.”