Table of Contents
Semantic Web in the Pharmaceutical Industry: A Landscape of Challenges and Opportunities
Presented by: Tim Williams (UCB)
Semantic Web technology is slowly gaining traction in the traditionally risk-averse pharmaceutical industry. In recent years, CDISC (www.cdisc.org) has led the charge for standards development and implementation across the data lifecycle, from pre-clinical animal studies to submission of results from clinical trials. PhUSE (www.phuse.eu) is facilitating cooperation between companies in the pre-competitive space, promoting innovation, and coordinating with regulatory agencies. However, many challenges remain. The industry continues to rely on a thirty-year-old file format (SAS transport version 5 XPT) for regulatory submissions. Most standards are published in the traditional row-by-column format, with multiple documents to support their implementation. Converting data between versions of standards is resource-intensive and costly. The industry also needs to integrate new types of data sources like Real World Evidence (insurance claims data, electronic health records, etc.) and Risk-Based Monitoring of clinical trials.
This session will provide an interactive discussion on the potential for implementing semantic technologies to solve the many challenges facing the industry. An overview of the data landscape will be provided, including the industry standards, adoption of the Semantic Web in pharma, and proposed approaches and initiatives to modernize using Linked Data. The Semantic Web can be employed to solve problems as diverse as data integration, standards management, regulatory submissions, and publication. Success will only be possible when barriers to communication and cooperation are overcome and Pharma openly participates in and contributes to the wider Semantic Web community.
The OMOP common data model for observational studies
Presented by: Michel Van Speybroeck (Jannssen)
The OMOP common data model (https://www.ohdsi.org/) has become a standard for the analysis of disparate observational databases and is currently being used in different European projects. In this workshop we’re going to explore the OMOP structure and look at the different data standards that are applied. Attendees will have the chance to get some hands-on experience with the data model. The session will be concluded with a review of the existing open source tools that can be used to support the systematic analysis of the data.
HL7 FHIR and the Semantic Web
Presented by: Harold Solbrig (The John Hopkins University School of Medicine)
HL7’s Fast Healthcare Interoperability Resources (FHIR®) is emerging as a next generation standards framework for healthcare related data exchange. FHIR-based solutions are built from a set of modular components (“Resources”) which can be assembled into working systems. FHIR is available in a variety of contexts including mobile phone apps, cloud communications, EHR-based data sharing, and system to system communication between and across healthcare providers.
This tutorial builds on the 2016 and 2017 tutorials on the same subject. It will describe how to access the FHIR technology stack, including resource definitions, API specifications and conformance profiles. In this tutorial we will construct, validate and query a simple FHIR resource instance using a FHIR testing server. It will also provide updates on the current state of the FHIR specification and potential future directions.
Knowledge representation for computational phenotyping
Presented by: Tudor Groza (the Monarch Initiative)
Phenotype has long been used to determine the likely underlying genetic etiology of rare disease in patients, as well as to substantially reduce the search-space for genomic variation. The analysis of phenotypic abnormalities provides a translational bridge from genome-scale biology to a patient-cantered view on human disease pathogenesis. Detailed phenotype data, combined with increasing amounts of genomic data, have an enormous potential to accelerate the identification of disease aetiology, facilitate disorder stratification, inform prognosis and improve the understanding of health and disease. This tutorial covers the current de facto standards developed by the Monarch Initiative to represent the knowledge underpinning computational phenotyping, ranging from the Human Phenotype Ontology to cross-species phenotype integration or the Phenopackets exchange format.
openEHR – the „open platform“ revolution
OpenEHR standardizes heterogeneous medical IT systems and the resulting dispersion of medical data in archetypes and templates. These archetypes are developed in collaboration with the international openEHR community, which consists of physicians, computer scientists and processing as well as standardization experts. The results are systems and tools that add value to decision-making support in day-to-day clinical practise and support research questions. These tools will reduce workload and improve semantic traceability. Furthermore, model-generated code and user interfaces are an area of continuous innovation in openEHR and promise to revolutionize health computing.
Agricultural semantics: challenges, opportunities and the AgroFIMS fieldbook
Unlike in the biomedical domain, the agricultural domain lags in adopting semantic technologies. However, the tremendous amount of data produced by a wide variety of actors, from farmers to researchers to private companies will benefit from the gains in interoperability and semantic richness conferred by these technologies. Indeed, unifying access to and linking this data with other domains can unlock powerful analytical capabilities and help to accelerate innovations to address global food security challenges. Several semantic standards and tools currently exist in the agricultural domain but the overlap in their content makes them difficult to reuse.
This tutorial is designed for anyone keen to learn about new use cases in the agrisemantics domain, to enhance skills that could be applied towards meaningful outcomes in leveraging the agricultural data ecosystem more effectively.
The tutorial will be organized in 4 parts. First, we will introduce the agricultural domain and its challenges in terms of the data landscape and the inherent sensitive nature of some data. A basic introduction to ontologies and semantic web technologies will be also part of this first session. Then, an in-depth overview of the existing semantic standards and tools used in the agri-community will be presented. Finally, we will take an example of how CGIAR is using semantic technologies to harmonize agronomic data from data collection to publishing. We will present the AgroFIMS fieldbook builder which relies on the Agronomy Ontology. Then, we will present how the data produced are semantically annotated from the start, and can therefore be easily published as Linked Open Data. We will also show how these data may be queried using SPARQL and presented in a user-friendly interface. This tutorial will also support an interactive discussion on how the SWAT4HCLS community can become engaged in the Agrisemantics domain.
Semantics in action
LOD4CG & BioFed: Publishing and Accessing Large-Scale Cancer and Biomedical Linked Data Resources
The Biomedical and Life Sciences domains have been the early adopters of Linked Data and Semantic Web Technologies, and at present a considerable portion of the Linked Open Data cloud is comprised of datasets from Life Sciences Linked Open Data, known as LS-LOD. Although the publication of datasets as RDF is a necessary step towards achieving unified querying of biological datasets, it is not enough to achieve the interoperability necessary to enable a query-able Web of Life Sciences data. This tutorial is designed for researcher and working practitioners in biomedical and life science domain who want to publish and access relevant datasets over the Web for finding meaningful biological correlations.
This tutorial will constitute of four separate sessions targeting audience from limited to advance knowledge in Web technologies to curate, access and consume biomedical data. In the start we’ll introduce basic Web technologies (RDF, XML, SPARQL) required for publishing and accessing the data. Secondly, we’ll present a thorough landscape of available biomedical semantic resources with a comparison regarding their usage, size and coverage and accessibility mechanism.
Then, we’ll present data accessing mechanisms specifically SPARQL query federation mechanisms over the Web in order to access relevant data at a large-scale. Lastly, we will present a complete data publication life cycle of building the Linked Open Data for Cancer Genomics (LOD4CG). The LOD4CG life cycle involves step-by-step process of data curation, linking, terminology alignment, visualization etc. for establishing links within LOD4CG and across the larger LOD cloud. The LOD4CG is supported by an intuitive interface to explore and visualize links among the TCGA, COSMIC, REACTOME, KEGG, and GO Datasets.
Background: The material and discussion presented in this tutorial are based on our various methods (data curation, linking, querying, etc.) developed within the scope of BIOOPENER1 project.
Prerequisite: The tutorial assumes – but does not require – a basic knowledge of RDF, SPARQL, OWL, and fundamentals of biomedical datasets.
Bioschemas, a lightweight semantic layer for Life Sciences
Websites in Life Sciences are commonly used to expose data, offering search, filter, and download capabilities so users can easily find, organize and obtain data relevant to them and their research field or interest. With the continuous growth of Life Sciences data, it gets more and more difficult for users to find all the information required for their research on one single website. Effective, and sometimes tailored, search engines are therefore a key resource for researchers. Schema.org is a collaborative and community effort providing schemas so data exposed on web pages can be semantically structured; thus, making it easier to determine whether a web page refers to a book, a movie or a protein. As it becomes a shared layer across many pages, it also makes it easier to interoperate with other websites. Bioschemas is a community project built on top of schema.org, aiming to provide specialized entities commonly used in the Life Sciences domain. Bioschemas reuses not only what already exists in schema.org but also what is available in well-known ontologies and controlled vocabularies. Bioschemas strategy is a commitment between simplicity and FAIR principles; rather than trying to expose all possible data encompass by your resources, go for those minimum that will boost findability, accessibility and interoperability. Two main steps are required: (i) from the long number of types and properties supported by schema.org, select those more relevant in Life Sciences for findability and summarization of data catalogs as well as their datasets and records, (ii) for common Life Science entities, define a flexible wrapper that can be later profiled according to specific needs. A Bioschemas profile acts as a pseudo-type including guidelines about marginality, cardinality and reuse of controlled vocabularies. In this tutorial, you will learn how to use Bioschemas profiles in order to add a lightweight semantic layer to your biomedical websites. We will guide you through the Bioschemas profiles for data repositories, datasets, data records and elements related to proteins, genes, samples, lab protocols and so on, showing existing and potential ways to take advantage of the Bioschemas mark up.
The CEDAR Workbench: Enhanced Authoring of Standards-Based Metadata for use on the Semantic Web
present by: Mark Musen (Stanford)
The Center for Expanded Data Annotation and Retrieval (CEDAR) works to enhance the authoring of metadata to make online datasets more useful to the scientific community. In this tutorial, participants will learn to use the CEDAR Workbench to manage a library of templates for representing experimental metadata and to fill in those templates in a structured manner, using terms from standard ontologies managed by the BioPortal resource. CEDAR facilitates uploading the user’s metadata (and associated data sets) to online repositories. By ensuring that the metadata conform with community-derived standards, and that the metadata use controlled terms whenever possible, the CEDAR Workbench helps to make online datasets FAIR.
Knowledge Discovery and Machine Learning in KNIME
(An introduction to Pattern Mining, Clustering and Classification)
The Healthcare and Life Sciences industries are set for a disruptive transformation as an increasing amount of data (genomics, sensor data, RWE) is coupled with advances in methods and technologies to leverage such data (e.g.: machine learning).
Data is being generated and stored continuously at a growing pace in a variety of industries and situations. Just think operational data, customer data, web logs, monitoring data, etc. Such data contains lots of valuable information that, if accessed and analyzed correctly, can help extract novel insights from the data (e.g.: discover users with similar browsing behavior), can be used to automate decisions and more.
With knowledge discovery and machine learning, we aim to discover and extract novel information and create models from historic data that later on can be applied, by itself or to new data. For instance, using different techniques, people have been able to discover single patterns indicating that buying behavior of a certain product closely relates to the presence (or absence) of a specific condition (e.g., time of the year, availability of another product, etc). Alternatively using historic labeled data of movies and their genres, we can build models that predict the genre of novel movies based on their cast or words in the title and short summary (i.e., the models classify the movie genre). Yet in another setting we might be interested in discovering clusters of customers with similar buying behavior that can be used for targeted advertising.
Essentially, the knowledge discovery and machine learning process boils down to integrating all relevant data, pre-processing the data to solve all real-world data related problems (such as inconsistencies, missing data, etc.), running algorithms to build models and interpreting the results (for instance using visualizations to improve the comprehensibility). In a typical setup, this process is streamlined and an analyst can repeat this process multiple times in an interactive setting to optimize the resulting models.
In this tutorial we will cover some of the basics of knowledge discovery and machine learning, and present different techniques to extract knowledge from data as well as techniques to build more complex models. We focus on model and techniques that have found a simple and broad usage in a variety of industries. Finally, we show how to use these techniques in a comprehensive demo in KNIME, an interactive tool for knowledge discovery and machine learning.
Data2services, converting your data to a standard data model
Today an increasing amount of data is available on the Web, but this data usually comes in a myriad of formats (XML, CSV, RDB…) with no inherent semantic representation. Inspired by the FAIR data principles (Findable, Accessible, Interoperable, Reusable), we propose Data2services, a framework built with scalability in mind to convert any type of data to a standard data model that follows the Semantic Web standards.
This data is then accessible as RDF through a SPARQL endpoint. We are also working on automatically generating web services based on the data model for simplified access.
The Semantic Web standards have been chosen over other information models for a variety of reasons:
- Standards collaboratively defined and adopted by the W3C
- An active and mature ecosystem (RDF, OWL, SPARQL)
- Can be applied to all domains (not only BioMedical data)
- Strong expressivity to define the relevant data model
- Machine readable
Executing the data2services pipeline on your machine using Docker to convert your data to a generic RDF, where the data representation is based on the input data structure.
The user will then craft SPARQL queries to map the relevant data from the generated generic RDF to the target data model.
- Docker already installed, or admin rights on your machine to install Docker
- Bring data you want to transform (relational database and/or XML-, CSV-, TSV-, PSV-files)
- Provide a target data model to convert your data to, or we can propose you one
- Data2services running on your hardware using Docker
- Your data now complying to a standard data model, and accessible from a SPARQL endpoint and auto-generated API.
presented by: Andra Waagmeester (Micelio)
Wikidata is Wikimedia’s knowledge base and a sister project of Wikipedia. In the OCLC Research 2018 International Linked Data Survey for Implementers, it ranked at 5 as a data source consumed by linked data projects/services. Wikidata runs on a infrastructure called Wikibase (http://wikiba.se/). At the 5 year anniversary of Wikidata, the community released a docker image allowing launching and hosting individual Wikibase instances. The release of this docker image sparked a series of Workshops (https://blog.factgrid.de/archives/835), that lead to various improvements, making it more straightforward to start your own linked database reflecting similar workflows, API and features of Wikidata
In this tutorial, you will learn how to set up a Wikibase instance and how to start populating that instance, either manually or through software bots.