Tutorials – 7th December

Tutorial Programme

Mon 7th Dec	Elton Room	Bowring Room
08:30	Registration opens
09:00-10:30	Tutorial on PubChemRDF in the life sciences	Tawny-OWL: re-purposing software engineering for ontology building
10:30-11:00	Break – refreshments in Garden Room
11:00-12:30	Tutorial on using Wikidata in the life sciences	Cont. Tawny-OWL: re-purposing software engineering for ontology building
12:30-13:30	Lunch – Garden Room
13:30-15:30	Semantic Representations of Clinical Care and Clinical Trial Data	BioSolr: Building better search for bioinformatics
15:30-16:00	Break – refreshments in Garden Room
16:00-17:00	Cont. Semantic Representations of Clinical Care and Clinical Trial Data	Processing Life Science Data at Scale – using Semantic Web Technologies
17:00-17:30	Break
17:30-18:30	DisGeNET: a discovery platform for translational bioinformatics (slides link)

Tutorial details

1. BioSolr: Building better search for bioinformatics
2. Semantic Representations of Clinical Care and Clinical Trial Data
3. Tutorial on using Wikidata in the life sciences
4. Tutorial on PubChemRDF in the life sciences
5. Tawny-OWL: re-purposing software engineering for ontology building
6. Processing Life Science Data at Scale – using Semantic Web Technologies
7. DisGeNET: a discovery platform for translational bioinformatics

1. BioSolr: Building better search for bioinformatics

Presenters: Tony Burdett¹, Matt Pearce², Tom Winch², Charlie Hull², Helen Parkinson¹ and Sameer Velankar¹

¹European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
²Flax, St Johns Innovation Centre, Cowley Road, Cambridge, CB4 0WS, United Kingdom

Data retrieval is common in bioinformatics databases, however, optimal strategies for indexing rich biomedical data are less well understood. Biomedical data often contains hierarchical components, such as annotations to ontologies, and therefore do not conform to the flattened document-based model imposed by most search technologies. BioSolr advances the state of the art with regard to indexing and querying biomedical data with open source software. This unique BBSRC funded collaboration between Flax, an open source search specialist company based in Cambridge, and The European Bioinformatics Institute (EMBL-EBI), brings together experts in biological data management and experts in utilising the world-leading Apache Lucene/Solr search engine framework to address the challenges of making biomedical data more accessible. Challenges include integrating ontology-enabled search and searching by common classification systems (taxonomy, enzyme classifications, protein families etc). BioSolr is developing software to facilitate indexing of ontologies, ontology driven faceting and other techniques of search that use the additional semantics provided in datasets that have been enriched with ontology annotations.

In this tutorial you will learn how to:

Install Solr
Index some example data with embedded ontology annotations
Use an example web application to search this data
Install BioSolr plugins to ontology-enrich your Solr index
Configure BioSolr
Perform ontology-powered searches
Further enhance your searches with dynamic, ontology-enabled facetting

Requirements
A Java 7 installation. We will be installing Solr 5 as part of the tutorial.

back to top

2. Semantic Representations of Clinical Care and Clinical Trial Data

Presenters: Eric Prud’hommeaux¹, Harold Solbrig²

¹W3C/MIT staff contact for the Semantic Web in Health Care and Life Sciences Interest Group
²Mayo Clinic, Phoenix, AZ, USA

The semantic infrastructure for clinical data has quietly arrived. HL7’s FHIR is dominating all venues for clinical data interoperability and HL7 is committed to an RDF format for FHIR. Earlier expression of HL7’s Reference Information Model in OWL effectively define the RDF representation for the clinical data exchange format C-CDA mandated by the US government. Joint industry and academic projects in Europe also express clinical data in RDF for drug research and translational medicine. Patient-driven services like Consent to Research mandate patient advocacy groups to demand ways to get prompt, complete patient records. From SMART Platforms’s commoditized smart-phone apps to revolutions in clinical trial data repositories and hopefully clinical
trial submissions, semantic representation of clinical data can improve the availability and utility for research, decision support, adverse events detection, outcomes measurement, outbreak surveillance, etc.

This tutorial will show how to use conventional tools like SPARQL and OWL to interpret, reason about, and extend clinical care and clinical trial
data. This will include downloadable personal health care data such as that
mandated by the US’s “Blue Button” and genomics which can affect diagnostic decisions and medication efficacy. This technology will enable the translational medicine systems of tomorrow.

Content overview

(Lecture) Clinical information and terminology models. This will describe the way clinical data is modeled and expressed, touching on the division of labor between the information terminology models. We will describe how systems like FHIR and C-CDA and BRIDGE capture clinical observations and patient meta-data.
(Hands-on) FHIR profiles (schema all the way down) and how they give some hope for interoperable clinical data. We will describe how profiles can be mapped to knowledge representation systems (OWL) and validation schemas (ShEx) for practical enforcement of conformance.
(Study) Mapping between information model representations. Participants will compare side-by-side conventional XML clinical data and the RDF graph conveying the same information.
(Hands-on) use XSLT to transform conventional clinical data to RDF. After turning the XSLT crank, we will perform sanity checks and validation on the generated data.
(Hands-on) OWL and SPARQL CONSTRUCT to map between BRIDG, O-RIM, FHIR
(Hands-on) Exploration of semantic clinical data. Participants will develop and execute SPARQL queries for:
- Evaluation of treatment efficacy
- Post-market drug surveillance.

Requirements

High comfort with SPARQL
Low comfort with Protégé
Familiarity, if not expertise in OWL.
Those not skilled in OWL will have to take some of the subsumption logic on faith, to be explained if there’s time and popular demand.
NO health care expertise is required – this is intended to help people acquire such expertise and contribute productively to semantic health care projects.

back to top

3. Tutorial on using Wikidata in the life sciences

Presenters: Sebastian Burgstaller-Muehlbacher¹, Elvira Mitraka³, Andra Waagmeester^1,2, Magnus Manske⁴, Benjamin Good¹, Andrew I. Su¹

(1) Department of Molecular and Experimental Medicine, The Scripps Research Institute, USA (2) Micelio, Belgium, (3) University of Maryland School of Medicine, USA, (4) Wellcome Trust Sanger Institute, Cambridge, UK

This tutorial will start with a general introduction into the inner workings of Wikidata and its relationship to Wikipedia. We will set out by explaining the structure of a Wikidata item describing a concept and its properties through labels, synonyms, claims and statements on that specific concept.

The tutorial would then continue with a presentation on the Protein Box Bot, a Python client that we have developed which manages information pertinent to Genes, Proteins, Diseases and Drugs on Wikidata. Next we will demonstrate how Wikidata content can be used to populate infobox content in Wikipedia pages in different languages. Finally, we will demonstrate how to access and visualize Wikidata content in other applications such as R or through JavaScript frameworks such as D3.js

We will also be able allow some hands-on training and actual (manually) adding content to Wikidata.

Audience
The tutorial is aimed at an audience of:

Data owners in the life sciences interested in opening their content for integration with other resources to enhance knowledge
Life scientists interested in using linked data from Wikidata in data analysis and in applications
People interested in learning how to edit data in Wikidata

Funding: This work was supported by the National Institutes of Health under grant GM089820, GM083924, GM114833 and DA036134

back to top

4. Tutorial on PubChemRDF in the life sciences

Presenters: Gang Fu, Evan Bolton

National Center for Biotechnology Information, National Institute of Health

The tutorial will illustrate the semantic annotations of PubChem databases using existing ontologies and the cross-references to other RDF-based biomedical resources. In addition to the RDF data modeling, we will also demonstrate how to search PubChem and interconnected databases using Semantic Web technologies, i.e. SPARQL queries and logic-based inference, to address complicated questions on behalf of biomedical research.

We will first give a general overview about PubChem databases and PubChemRDF project. In particular, we will go through the RDF data modeling in details and explain how the links were generated in the presentation. We will also demonstrate how to programmatically access and download the RDF triples, as well as how to deploy them in a RDF store. The PubChemRDF REST-full interface allow users to formulate simple SPARQL-like queries to group and filter associated entities. A couple of interesting query samples will be presented. Once the users are familiar with PubChemRDF data contents and data availability, the tutorial would then continue with a couple of SPARQL queries to address complicated biomedical questions. How ontological framework with logic-based inference can help querying will be demonstrated as well.

Audience
The tutorial is aimed at a group of audiences of:

Semantic Web data scientists who build the links and applications upon the RDF-based life science data
Research scientists interested in using linked data in PubChemRDF for data analysis and complicated queries
Data contributors across chemical and biological domains who are interested in publishing their data for integration with other resources to enhance knowledge based discovery.

back to top

5. Tawny-OWL: re-purposing software engineering for ontology building

Presenters: Phillip Lord, Jennifer D. Warrender

Newcastle University, Newcastle, UK

Tawny-OWL addresses the need for a programmatic tool for the construction of ontologies. With an increase in their size and complexity, programmatic access enables developers to modify and extend ontologies rapidly, to abstract, enforce consistency in patterns and to enable maintainable change. Tawny-OWL combines an attractive and simple syntax, with the power of a rich programming language. It enables us to exploit a rich ecosystem of programming tools and apply them to ontol- ogy development. Rich IDEs and all the tools of the programmer come for free. It is R to Protege’s Excel. It is built on the OWL API, however, so also integrates with the existing ecosystem of ontology-specific tools. Participants will finish with a clear understanding of ontology construction using Tawny-OWL, how to exploit its programmatic facilities and environment, and the current and future applications for it.

Aims
At the end of the tutorial, participants will:

understand the motivation behind Tawny- OWL
understand and use basic Clojure infrastructure
have built a sample ontology, using Tawny-OWL
understand pattern usage within Tawny- OWL
have implemented a pattern, using Tawny-OWL
understand the relationship to programmatic IDEs and related tools

Content overview

Motivation, and Introduction to Tawny- OWL
Creating classes. statements
Automating disjoint
Creating object properties. Automating inverse statements.
Creating classes in bulk and from lists. Automating covering axioms.
Creating defined classes and reasoning in Tawny-OWL
Patternised ontology development. Manipulating classes in bulk.
Multiple mechanisms for interacting with ontologies.
Talk on advanced features: the Karyotype Ontology

Requirements
NEED:

a working knowledge of OWL and ontologies
basic knowledge of amino acids

DO NOT NEED:

to have any experience of Tawny-OWL
to have knowledge of the OWL API
to have any experience of Clojure
to be highly-experienced programmers, but WOULD BENEFIT from some program- ming experience
to have any familiarity with an IDE/Editor with Clojure support, but WOULD BENEFIT if they followed instructions to run Clojure “hello world” program in their IDE of choice

back to top

6. Processing Life Science Data at Scale – using Semantic Web Technologies

Presenters: Ali Hasnain, Naoise Dunne, Dietrich Rebholz-Schuhmann

Insight Center for Data Analytics, National University of Ireland, Galway

The life sciences domain has been one of the early adopters of linked data and, a considerable portion of the Linked Open Data cloud is comprised of datasets from Life Sciences Linked Open Data (LSLOD). The deluge of biomedical data in the last few years, partially caused by the advent of high-throughput gene sequencing technologies, has been a primary motivation for these efforts. This success has lead to the growth in size of data sets and to the need for integrating multiples of these data-sets. This growth requires large scale distributed infrastructure and specific techniques for managing large linked data graphs. Especially in combination with Semantic Web and Linked Data technologies these promises to enable the processing of large as well as semantically heterogeneous data sources and the capturing of new knowledge from those. In this tutorial we present the state of the art in large data processing, as well as the amalgamation with Linked Data and Semantic Web technologies for better knowledge discovery and targeted applications. We aim to provide useful information for the Knowledge Acquisition research community as well as the working Data Scientist.

Aims
Our learning objectives are the following:

Provide basic knowledge regarding the fundamentals of Large Scale Data in Life Sciences and related technologies.
Elaborate how semantic web technologies are useful for managing Large Scale Data.
Elaborate to access and benefit from semantic data on the Web.
Elaborate how to make use of Large Scale Data and introduce some of the

Content Overview
Our tutorial will consist of following sessions:

Classical Query Federation: In this session we will discuss the concepts around SPARQL query Federation to access multiple heterogeneous biological datasets to draw meaningful biological co relations. Real Life sciences Dataset e.g Drugbank, Dailymed will be queried to elaborate SPARQL Federation.

Scalable Infrastructure: In this session we will discuss the concerns, available tools and current trends in the creation of the distributed infrastructures for processing of large Linked Datasets at scale.
We will begin with what considerations are needed when building and running these infrastructures, highlighting the rationale for using containers, the need for schedulers to manage both jobs and resources, and the importance of managing failure in Large Scale Data architectures. We will then explore existing popular schedulers Hadoop, Yarn and Mesos and the use of Docker containers. We will then focus on the specific concerns of Linked Data Pipelines, and the trade-offs vs other architectures. The evolution from Batch to Real Time to Lambda Architectures and emergence of Reactive Pipelines. Finally we will review our own Big Linked Data Knowledge pipeline and how we met these concerns.

Graph Aggregation: In this session we will discuss how we can generate useful aggregations of the distributed graph to better understand the structure of the underlying data. This helps us to better interact with the graph, helps our understanding of large linked data sets where some of the data schema is missing or mixed with other schema and finally how these techniques can be used to extend the SPARQL language to include aggregation operations such as to analyse biological networks.

Visualisation: In this session we will demo different visualization tools and application build to visualize Big RDF Data. Applications include ReVeaLD- a Real-time Visual Explorer and Aggregator of Linked Data [8], Genome Wheel- GenomeSnip- Fragmenting the Genomic Wheel to augment discovery in cancer research and FedViz- A Visual Interface for SPARQL Queries Formulation and Execution.

Hands On Session: In this session we will give the audience some practical exposure to the tools and technologies discussed in earlier sessions. Using a sample rdf dataset, we will ask the audience to explore using a visualisation tool to create SPARQL query. We will guide the audience to create a graph summary of this rdf dataset, and show how this can help group results in a meaningful way.

Requirements
Basic knowledge of SPARQL and Linked Data, a text editor and a HTML 5 enabled web browser.

back to top

7. DisGeNET: a discovery platform for translational bioinformatics

Presenters: Janet Piñero, Núria Queralt-Rosinach, Àlex Bravo, and Laura I. Furlong

Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, Barcelona, Spain

In this tutorial we will present the DisGeNET Discovery Platform for the dynamical exploration and analysis of human diseases and their genes. The platform consists of a comprehensive knowledge base of over 400,000 gene-disease associations arising from both expert-curated databases and information extracted from the scientific literature using text mining, with special attention paid to the explicit provenance of the association. The DisGeNET platform provides a set of analysis tools designed to facilitate and foster the study of the molecular underpinning of human diseases. The tutorial will illustrate the semantic knowledge representation of DisGeNET, and how to explore and analyze the data using different tools with special emphasis on harnessing the data using Semantic Web technologies.

Content Overview

The tutorial will be divided in two parts:

A) We will offer a lecture on DisGeNET to show several unique features that make it a very useful platform for biomedical researchers such as mappings to different biomedical vocabularies, or the DisGeNET score that rates the confidence of each gene-disease association based on the supporting evidence. We will look through the RDF data modeling and the cross-references to other RDF-based biomedical resources to explain how to query DisGeNET and interconnected databases using Semantic Web technologies to answer complex biomedical questions.

B) We will do a hands-on session to illustrate how to use the suite of analysis tools of the knowledge base: 1) a Web interface that supports user-friendly data exploration and downloading, we will also demonstrate how to programmatically access the data; 2) the DisGeNET Cytoscape app for network analysis of DisGeNET data; 3) the SPARQL endpoint and Faceted Browser to show how the information contained in DisGeNET can be explored, queried and expanded to a variety of external RDF resources already present in the Linked Open Data cloud using Semantic Web technologies, and how to perform ontology-walking queries using disease ontologies.

Requirements

Installation of Cytoscape (version 3.x)
Basic knowledge on SPARQL and Linked Data

Audience
The tutorial is aimed at a variety of audiences: the bioinformatician and software developer that interrogates the database programmatically or using Semantic Web technologies, the systems biology expert that explores and analyse network representations of the information, and biologists or healthcare practitioners who interrogates the database using its user-friendly Web interface.

Funding
The research leading to these results has received support from Instituto de Salud Carlos III-Fondo Europeo de Desarollo Regional (PI13/00082 and CP10/00524), the Innovative Medicines Initiative Joint Undertaking under grants agreements n° 115002 (eTOX) and nº 115191 (Open PHACTS)], resources of which are composed of financial contribution from the European Union’s Seventh Framework Programme (FP7/2007-2013) and EFPIA companies’ in kind contribution. This project has also received funding from the European Union’s Horizon 2020 research and innovation programme 2014-2020 under Grant Agreement No 634143. Laura I. Furlong received support from Instituto de Salud Carlos III Fondo Europeo de Desarollo Regional (CP10/00524). The Research Programme on Biomedical Informatics (GRIB) is a node of the Spanish National Institute of Bioinformatics (INB).