Continue in 2 seconds

UMIT Advances Cancer Research with Open Source Data Integration

  • April 11 2008, 10:26am EDT

REVIEWER: Bernhard Pfeifer, associate professor at University for Health Sciences, Medical Informatics and Technology (UMIT) Institute of Biomedical Engineering.

BACKGROUND: The UMIT, based in Hall, Austria, is a key participant in the advancement of prostate cancer research. At the heart of our initiative (project name: IMGuS), we apply high throughput data processing to patient samples to spot molecular signatures and identify patients who may have more curable forms of cancer and would likely be more receptive to treatment.

PLATFORMS: The data warehouse with extract, transform and load jobs is running on HP DL320 and Apple Xserve along with Linux Suse Enterprise Edition 9.1 and Apple Xserve: OS X 10.4.

PROBLEM SOLVED: A large part of cancer research today consists of data processing and statistical analysis. While there is a multitude of data published and available, there are no governing standards for tracking information relating to cancer research. This creates an enormous challenge to analyze findings due to the use of different software applications and databases to collect and report findings. Through project IMGuS, UMIT is taking the lead in aggregating research data relating to prostate cancer. Talend’s data integration software, Talend Open Studio, has made it possible for UMIT to quickly, affordably and accurately pool information from disparate databases into one source available to anyone through one single Web portal.

PRODUCT FUNCTIONALITY: Talend Open Studio brings UMIT’s prostate cancer research project many key advantages. The PostgreSQL-based LINDA data warehouse, which houses data used for the statistical analysis, is loaded in two stages. The “Electronic Data Capture” stage centralizes data from the different sources: patient samples, reference medical data, genome cartography, etc. The separate data sources are very diverse, with varying formats. Administrative data is loaded at this stage. The second loading stage reconciles, transforms, cleanses and enriches the data aggregated during the first capture stage, and loads the LINDA data warehouse. At this point, UMIT introduces the reference data from additional resources. Through both stages, Talend’s native support of Web services and XML brings tremendous value to the project. It enables us to very easily cross-reference external sources of data, significantly reducing the time it would otherwise take to populate the data warehouse. The frequent refresh of the data warehouse, performed nightly, ensures that researchers can use ad hoc query and data mining tools, and apply advanced statistical models to extract the most up-to-date data relevant for their research.

STRENGTHS: Talend Open Studio was downloaded with no license cost by the UMIT project IT team. It immediately provided the default connectors needed to efficiently and effectively migrate and then integrate data contained in different software applications and formats from five research database sources into one single source. This infinitely decreased the amount of time required to conduct research. Talend’s ability to integrate Web services is critical. Written in Java, it is easy to customize the source code.

WEAKNESSES: A lot of the exception handling has to be managed specifically. The tool could use some improvements in this area, making it more user-friendly to define how to process error conditions. We are having ongoing discussions with the community on this topic and expect these discussions to result in a product enhancement.

SELECTION CRITERIA: UMIT explored a number of data integration solutions, both proprietary and open source, and settled on Talend’s solutions because of their flexibility, openness and high performance. It was critical that the chosen solution not only work with all data sources, but be able to integrate specific data processing approaches. Talend’s open architecture allowed UMIT to develop specific components to access and process this data.

DELIVERABLES: UMIT relies entirely on Talend’s solutions for all data integration needs. The timeliness and accuracy of the data housed in the resulting database enables researchers to conduct ad hoc queries on current data and to apply advanced statistical models to advance prostate cancer research.

VENDOR SUPPORT: We have interacted a lot with the community in which Talend’s R&D team is heavily involved through their online tools: forum, wiki and Bugtracker. Talend’s local team in the German-speaking region took a vested interest in our project and was a big help to us.

DOCUMENTATION: With the download of the Talend Open Studio software, we received a comprehensive set of documentation and tutorials that provided the guidance we needed to initiate and execute our project.

Talend Open Studio
105 Fremont Ave., Suite F
Los Altos, CA 94022
(650) 396-7738

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access