NOV 1, 2004 1:00am ET

Related Links

The State of Cloud Standards
February 10, 2012
CIO Stepping Stones to Success
February 10, 2012
Oracle to Buy Taleo
February 9, 2012

Web Seminars

Getting Started with Big Data
Available On Demand
Transactions & Interaction: The Correlation of Structured and Unstructured Data
Available On Demand
Deliver Better Enterprise Data through Better Reference Data Management
Available On Demand

Optimal Data Architecture for Clinical Data Warehouses

Print
Reprints
Email

While electronic data capture brought modern computer front ends to clinical trials, back-end processing of clinical data remains relatively medieval. Trial designs for new products evolve rapidly, forcing downstream analysis teams to stovepipe each uniquely structured source data set into separate schemas and hand-adapt the extract, transform and load (ETL) code. Because this tactical approach is cumbersome, time-consuming and non-repeatable, it does not scale well, and many clinical data warehouse initiatives by even the biggest pharma/biomed companies implode under the pressure of mushrooming costs and elusive ROI.

The solution to warehousing the ever-changing structures of clinical data will require 1) a single, multi-trial repository based upon abstract rather than directly representational data models, and 2) adaptive, meta data-driven ETL programming.

Such an architectural shift will streamline the acquisition process for new trials and greatly simplify validation. By providing multi-trial data integration by design, this adaptive repository will also allow R&D companies to shorten the length of many clinical trials, even eliminating some altogether, greatly decreasing time-to-market in an industry where new products can generate millions of dollars of revenue per day.

R&D companies impose a threefold mission upon clinical data warehouses (CDWs): streamline data analysis and submission of trial results to regulatory agencies, provide trial progress metrics so trial managers can ensure milestones are met and enable cross-trial data mining.

If trial designs were ever finalized, this mission might be achievable; however, the clinical research evolves constantly. Figure 1 depicts the typical clinical trial process and highlights four common types of revised requirements that force IT to repeatedly reengineer even a single-trial data repository.


Figure 1: Typical Clinical Trial Process

For example, scientists frequently define additional data points as the study's hypothesis is refined. Additionally, clinical QA may revise core trial processes such as adverse event adjudication. Also, trial managers sometimes devise new metrics that better reveal trial progress. More urgently, the FDA often requests additional analyses that require new, derived variables.

Significant change occurs between trials as well, as study teams invent new case report form (CRF) elements and technological advances introduce new ways to measure product safety and efficacy. Figure 2 depicts the pervasive impact that any one of these changes can wreak upon nearly every component of a CDW.


Figure 2: Pervasive Impact Change Has on CDW

With such erratic source data, low-maintenance, multi-trial clinical data warehouses are beyond the capabilities of traditional database techniques, yet life science firms must have such nimble CDWs to compete. Developing the data target structures and transformation scripts for a single trial basis can take weeks, delaying the revenue stream that new products promise. Even for a moderately sized life science company, a CDW whose structures and code can be simply reused could generate more than $20 million in revenue by eliminating this delay.

Additionally, if the CDW held multiple trials with accurate business meta data, biostatisticians could aggregate representative patient populations from previous trials and substitute these aggregates for real patient data in regulatory submissions. At more than $3,000 per patient, reducing a Phase III trial population from 5,000 to 2,500 represents more than $7 million in savings on one trial alone.

Furthermore, data mining with a multi-trial CDW could generate statistical evidence of a product's superior safety or efficacy for specific subpopulations of patients, enabling its marketing department to better define medical niches to dominate, thus enhancing margins and prolonging product life.

FIRST-GENERATION CDWs FALL SHORT

Seeking such high potential ROI, clinical IT organizations have typically tackled the challenges of CDW through what can be called a first-generation (1G) approach. Born out of classic relational database techniques, these 1G repositories include a separate table for every CRF in the actual trial.

The disadvantages of this approach are many. First, without some form of target schema automation, every column in the source requires at least one handcrafted target column and a concomitant block of ETL code. Designers must also know of these required columns in advance, leading to a painstaking requirements gathering effort which greatly delays the repository's go-live date.

Second, because new CRF elements can be accommodated only with considerable effort once the trial commences, IT will be slow to respond and might even "push back" when study teams wish to redesign their forms. Third, the differences in target schemas for non-equivalent trials make cross-trial data integration data difficult, impairing valuable data mining. Finally, so much labor is required to build and validate each trial's repository that IT teams forego many of the advanced features CDWs should have, such as recording scientific meta data and tracking data subsets used in publications.

LEAPING HURDLES WITH EAV

Given such disadvantages of repositories based upon directly representational data models, IT should refocus its efforts to base the CDW upon an abstraction of the trial data. Figure 3 presents how an "adverse events" (AE) record would appear once stored in such a meta model. Whereas in a traditional third normal form (3NF) model each attribute of the AE entity spawns a distinct column, in the meta model, each attribute has been pivoted down into a separate row. This technique is called "row modeling," and the resulting schemas are known as "EAV" models because their tables consist mostly of three columns: entity, attribute and value.


Figure 3: How an AE Record Appears When Stored in a Meta Model

EAV schemas are in fact only a slight extension of the name-value pair paradigm used in many Internet applications today. With this in mind, the function of the attribute and value columns are easy to understand. The entity column is added to record the source table (CRF) from which each observation originates. Accordingly, a trial with scores of separate source tables may very well land in an EAV repository with only a handful of target tables.

Advertisement

Comments (0)

Be the first to comment on this post using the section below.

Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.
Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.