Tutorial 5: Horizontal and vertical medical data federation: Linking clinical and DICOM data using Semantic Web technologies

Johan van Soest, Tim Lustberg, M. Scott Marshall, and Andre Dekker

Maastricht University Medical Centre+,
Department of Radiation Oncology (MAASTRO), GROW School for Oncology and
Developmental Biology,
Maastricht, The Netherlands
http://www.maastro.nl

Tutorial: https://github.com/jvsoest/Data-Integration-Tutorial/wiki

Abstract

Clinical data is widely available in hospitals, however isolated in source systems. This limits secondary use of clinical data, as it is not findable, accessible, interoperable and reusable (FAIR). We propose to overcome these issues using Semantic Web technologies. In this tutorial, participants will learn to understand the problem of working with clinical data, and the need for linked data. We will present the example of radiotherapy data, and guide participants how to make this data FAIR; promoting secondary use of clinical data, and translational research.

Call for Participation
Clinical (healthcare) data is widely available in hospitals, however is hard to standardize and integrate among different systems used by different departments (e.g. electronic medical records, radiology information systems, oncology information systems, radiotherapy treatment planning systems, laboratory information systems etc.). All different vendors and systems have their own view on their specific (sub-)domain of data to represent, and therefore have developed their own standard for data modeling and representation; creating so-called isolated information silos.

These information silos limit secondary use of clinical data, as bits and pieces of data need to be integrated before we can improve research by investigating multiple data elements originating from different sources. As a consequence, most of the researcher’s time is invested in data collection, merging and cleaning, rather than the actual research and the synthesis of new knowledge. This is contrary to the more recent focus that research data should be FAIR: Findable, Accessible, Interoperable and Reusable; meaning that data should have explicit semantics on structure and actual data values [4].

In recent years, most hospitals invested in datawarehouse (DWH) techniques to bridge this gap between primary (clinical) and secondary (e.g. research, financial, operational) use of data [1]. Although these incentives are solving the immediate issues for researchers by integrating data from different hospital sources (federation of vertically partitioned data), it doesn’t solve the problem of FAIR data and limiting the exchange of research data between hospitals.

To overcome these issues, we would propose Semantic Web technologies to overcome these issues. Specifically using the Resource Description Framework (RDF), the SPARQL query language, and RDF(S) and OWL-based ontologies. Therefore, the aim of this tutorial is to present and disseminate the knowledge of our solution to integrate the data from multiple source systems in a radiotherapy context; and implicitly enabling federation of horizontally partitioned data (partitioning between hospitals).

Motivation
The flexibility of the Resource Description Framework and SPARQL queries enables federation of horizontally and vertically federated data. This federation can be achieved on-write (when filling the RDF store), or on-read (defined in the SPARQL queries). The latter option will be discussed in this tutorial, as this is the most interesting for ad-hoc exploratory analysis of data. In order to represent concepts important to research in radiation oncology, we developed the Radiation Oncology Ontology (ROO; see http://bioportal.bioontology.org/ontologies/ROO), which reuses existing ontologies and links concepts from different ontologies. Furthermore, this ontology gives guidance on the data representation of the actual patient data by adding predicates to link instances of classes (e.g. instance_of_patient hasDisease instance_of_disease).

Using this ontology, we have converted data from cancer patients into RDF to achieve federation of horizontally partitioned data (see www.eurocat.info/community.html). In a different project, we have developed an ontology to represent the medical imaging file standard (DICOM), and developed a tool to convert the metadata of these DICOM files into RDF [3]. Furthermore, we developed an open-source additional tool to extract image-derived information and radiotherapy-specific information from these files (see https://bitbucket.org/account/user/maastrosdt/projects/MIA). Extracted results were afterwards again stored in an RDF data store (see http://ebooks.iospress.nl/publication/37469).

In this tutorial, we want to present the infrastructure for the extraction from these different, clinically used, data sources and perform federation at query execution. We do think this is interesting for SWAT4LS for several reasons:

Semantic Web experts will have the opportunity to get more insight into the problems with clinical data, while being able to suggest new/better approaches from their background
Clinical (and radiotherapy) experts can see how semantic web techniques can help to cope with horizontal and vertical separated data.
It shows the possibilities of linked data, for both horizontal and vertically federated data.

Organisers

Johan van Soest, MSc: the presenter of this tutorial; author of the CMO for lung cancer data and active project member for the SeDI project and image extraction pipeline. Furthermore was the first author on a book chapter on machine learning in a multicenter setting for radiation oncology [2]. As PhD student involved in distributed learning (not sharing data, but sharing algorithms) and administrator of a national biomedical imaging archive.

Tim Lustberg: co-presenter of this tutorial; software development lead on the image analysis pipeline, and has experience in extraction and translation of data from clinical systems into RDF. As PhD student involved in infrastructure, extraction and application of machine learning to optimize clinical workflows.

Prof. Dr. Andre Dekker: Professor of clinical data science, and main author of the Radiation Oncology Ontology (ROO). This ontology is used to link concepts from different ontologies and to enhance the semantic concepts by defining relationships between these concepts. Principal Investigator on several global projects regarding distributed machine learning (not sharing data, but sharing algorithms), which heavily relies on using standardized terminologies and flexible infrastructures. Has experience in teaching courses for medical students, and as a well-known speaker in his field around the world.

Detailed Description

Aims
At the end of the tutorial, participants will:

Understand the problem with clinical research data
Understand the issue with DICOM data and it’s discrepancy between clinical and research use
Have built database-to-rdf scripts to extract clinical data
Understand how DICOM metadata from clinical images can be extracted
Understand how image derived (radiotherapy) data from DICOM data can be extracted
Have built a query federating over multiple sources to retrieve clinical and imaging data

Overview of Content
During the tutorial, participants will identify the clinical variables needed in a sample dataset and build database-to-rdf scripts using d2r to convert this clinical data into RDF. Furthermore, we will show how DICOM metadata can be converted into RDF, and information derived from images. Finally, participants will learn how the data from these sources is linked together, enabling federated queries among all three sources.

The list of contents is as follows:

Introduction to the tutorial; participants introduce themselves.
Motivation and introduction into the problem with clinical data and DICOM
Creating/editing D2R scripts (http://d2rq.org) to extract data from an EMR example
Show how DICOM metadata to RDF works using a commercial product
Show how DICOM image feature extraction works using our open-source MIA infrastructure
Building a federated query over all three data sources

These items will form the main structure of the tutorial. During this tutorial we expect questions and discussions about data modeling and infrastructure setup. Furthermore, we will compare this to available DWH techniques, specifically addressing the flexibility of Semantic Web technologies for data federation. In the end, we will hypothesize how this works in a distributed learning environment, where researchers do not have access to the original data sources, but only have permissions to launch an algorithm in an (external) hospital infrastructure.
This tutorial will mostly be of practical purpose for Semantic Web specialists with interest in healthcare data, or healthcare IT employees interested in the practical use of Semantic Web technologies.

References
1. E. Roelofs, L. Persoon, S. Nijsten, W. Wiessler, A. Dekker, and P. Lambin. Benefits of a clinical data warehouse with data mining tools to collect data for a radiotherapy trial. Radiotherapy and Oncology, 108(1):174–179, July 2013.
2. J. van Soest, A. Dekker, E. Roelofs, and G. Nalbantov. Application of Machine Learning for Multicenter Learning. In I. El Naqa, R. Li, and M. J. Murphy, editors, Machine Learning in Radiation Oncology, pages 71–97. Springer International Publishing, 2015.
3. J. Van Soest, T. Lustberg, D. Grittner, M. S. Marshall, L. Persoon, B. Nijsten, P. Feltens, and A. Dekker. Towards a semantic PACS: Using Semantic Web technology to represent imaging data. Studies in Health Technology and Informatics, 205:166–170, 2014.
4. M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E. Bourne, J. Bouwman, A. J. Brookes, T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C. T. Evelo, R. Finkers, A. Gonzalez-Beltran, A. J. Gray, P. Groth, C. Goble, J. S. Grethe, J. Heringa, P. A. t Hoen, R. Hooft, T. Kuhn, R. Kok, J. Kok, S. J. Lusher, M. E. Martone, A. Mons, A. L. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, S.-A. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M. A. Swertz, M. Thompson, J. van der Lei, E. van Mulligen, J. Velterop, A. Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, and B. Mons. The FAIR Guiding Principles for
scientific data management and stewardship. Scientific Data, 3:160018, Mar. 2016.