The U.S. Food and Drug Administration has embarked on a project that the agency considers critical for its future: the modernization of its IT platforms to get ready to handle the enormous data sets the FDA receives each day from healthcare providers, product manufacturers, regulatory groups and other organizations.
The growth of these datasets is due to the integration of new submissions such as marketing applications—often containing clinical trial and animal study data—and adverse event reports into existing databases, as well as the rising numbers of submissions each year, says Taha Kass-Hout, chief health informatics officer at the FDA, who’s responsible for overseeing the agency’s big data and analytics initiatives.
“In addition, the relatively new scientific ability to write all the data about an entire genome is starting to lead to marketing applications that include genomic data and laboratory investigations that include identifying the genome for pathogens involved in infectious outbreaks,” Kass-Hout says.
New treatments and vaccines also are expected for Ebola, although “genomic data probably won’t be included in the near future, so the expected sizes of the various datasets involved in Ebola-related submissions probably won’t increase dramatically,” Kass-Hout says.
More Case Studies:
The FDA is responsible for protecting the public health by ensuring the safety, efficacy and security of human and veterinary drugs, biological products, medical devices, the nation’s food supply, cosmetics, and products that emit radiation. It’s also responsible for advancing the public health by helping to speed innovations that make medicines more effective, safer, and more affordable and by getting to the public the information needed to use medicines and foods to maintain and improve health.
This year alone, the FDA expects to receive somewhere between 1.5 million and 2 million submissions through its eSubmission Gateway—an agency-wide solution for accepting electronic regulatory information for review—and some drug and genomic submissions can now be as large as 1 terabyte in size.
“This is the very definition of a big data problem, and traditionally it has been the Achilles’ heel for efficient review and compliance activities,” says Kass-Hout. “The data coming in is larger, more frequent and with huge variations in format, standards and quality.”
Data varies from structured online forms, industry drug and medical device submissions, labeling updates and genome sequences, to highly unstructured data such as handwritten reports, videos, pictures, narratives and clinical trials submissions.
“The data integration and data fusion capabilities of big data solutions can help the FDA integrate, manage, secure and analyze all of the agency's information,” Kass-Hout says. “With big data-driven analytics, FDA can implement data mining, predictive analytics, text mining, forecasting and optimization to explore options and make the best possible regulatory decisions.
However, he says, “Due to its enormous complexity, big data requires new approaches for processing and storage.”
To meet the challenges of big data—and reap the opportunities—the FDA is creating an IT environment designed to handle extremely large amounts of data and provide tools to identify and extract valuable information that it can analyze.
The FDA’s new big data architecture includes four main steps: ingest, transform, process and analyze.
Ingest involves the intake of structured and unstructured information from within and outside the organization. Sources include submission data, next-generation DNA sequencing, and even social media outlets.
Depending on the source, it is filtered through various systems to convert the database and/or data format, perform some data cleaning and curation, and add application programming interfaces, Kass-Hout says. “Then, it is transformed into complex, highly- and multi-structured big data and stores. Via distributed computing using the cloud, the information can be processed, queried and analyzed throughout the agency.”
Among the big data technologies the FDA has deployed within its IT infrastructure are interactive, real-time and advanced analytics tools from vendors including Google, HP, Intel, ParStream, Pentaho, SAS, Splunk, Tableau, Teradata and Tibco.
For real-time data processing, the agency uses systems from the Apache Software Foundation, Cloudera and others, and it uses Hadoop for batch processing.
Much of the backbone for its big data initiatives is in a hybrid cloud model that includes existing data centers or high performance computing environments available at FDA.
The cloud portion includes public platform-as-a-service (PaaS) offerings such as Microsoft Azure, EMC and Google; and infrastructure-as-a-service (IaaS) offerings from Amazon Web Services, HP and IBM. The on-premise systems include distributed computing platforms from HP and IBM. Big data storage is handled by systems from a multitude of vendors.
Big Data Benefits
The FDA is applying big data and analytics in a number of areas of its operations, and has seen benefits. For example, it increased its ability to store, manage and analyze genetic sequences to accelerate and more accurately conduct epidemiologic investigations. The agency can also better understand pathogenic virulence, identify drug resistance mutations, and manage huge volumes of scientific, clinical and regulatory information that support its regulatory reviews and compliance investigations associated with unlawful uses of drugs and medical devices.
It also supports “our ability to capture and manage FDA regional laboratory data collected in response to food borne illness and pathogen identification and conservation activities,” Kass-Hout says.
One of the most important initiatives supported by big data and analytics is openFDA, a research and development project that Kass-Hout launched after his arrival from the Centers for Disease Control and Prevention in March 2013.
The openFDA initiative features a user community for sharing open source code, examples and ideas. It began with the purpose of offering developers and researchers easy access to high-value FDA public data, “high value being data that are most frequently requested by the public,” he says. This includes adverse event reports for drugs; information on thousands of drug, device and food recalls; data about device malfunctions; and labeling for more than 65,000 products such as prescription and over-the-counter drugs.
The first data loaded was the drug adverse event reports database. Users are free to work with whatever data they choose, he says, and many have already.
“The main goal has been to make it simple for an application, mobile or web developer to easily use data from the FDA in their work,” Kass-Hout says.
OpenFDA employs best practices in data science, cloud computing, community management and content management to create a flexible platform designed to serve the agency and spur citizens and the private sector to work on innovative projects using FDA big datasets.
After eight months of evaluation and development, the agency successfully launched the open.FDA.gov web site with the first API in June 2014. The API combines several FDA data sets totaling hundreds of gigabytes in size with various formats (structured, unstructured, and semi-structured) to create a useful and informative resource, Kass-Hout says. “The platform will allow growth to any size as the datasets grow and additional datasets are added,” he says. “The platform has sustained over 100 requests per second per node in the cloud, and can be scaled to much more.”
Since the release, the FDA has developed three more APIs, for recalls, prescription and over-the-counter drug and biologics labeling, and medical device reports. There have been more than 55,000 visitors from more than 20,000 connected Internet devices, Kass-Hout says.
In addition, openFDA has served as a test bed for the FDA to try various technologies.
“In particular, the effort has been to learn about how best to develop and release data, and how to engage a community of users around that data,” Kass-Hout says. “In just 12 months, using an agile software development and user-centered usability approach, we believe openFDA has successfully delivered on all fronts, while providing a foundation for a great deal more to learn and do.”
As an R&D project, openFDA “is intended to serve as a big data testing and development platform for the agency, to engage broadly with various agency partners, and to disseminate our approach and learning widely within the agency and across the public and private sectors,” he adds.
Still, the FDA faces three primary challenges with regard to leveraging big data: transfer rates of very big datasets, data standards and interoperability, and data management and provenance (or origin).
“Managing terminologies, code sets and mapping between code sets is a very complicated issue that is a frequent drag on our efficiency,” Kass-Hout says. “For extremely large datasets, I/O [input/output] is a tremendous problem without a fantastic market solution. Data management and provenance is also a big issue for us—as it is for most data-driven organizations.
Meanwhile, the flood of data will not abate,Kass-Hout says. “FDA centers and offices require ever-expanding data sets from our own intramural research, academic research, the NIH [National Institutes of Health], social media, the EMA [European Medicines Agency], and reference data stores from all over the world,” he says.
Effective big data management “gives us the ability to quickly identify the desired information without having to rely on massive infrastructure and a single structured data platform,” Kass-Hout says. This gives the agency the ability to mine, model, simulate, hypothesize and predict much faster than ever before, which can directly be correlated to better intervention, safety alerts and preventing dangerous products from reaching the public, he says.
“Everyone talks about the the Internet of everything,’” Kass-Hout says. “Well, imagine the possibility of ingesting electronic health record data, adverse event data, regulatory inspections, clinical trial data and research data into a single environment with incredible new search and cognitive computing technologies. It vastly changes the questions we can ask, and for the FDA this is transformative.”
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access