Swat4ls Conference in Basel, 13-16 Feb 2023



Tutorial descriptions


Introduction to SPARQL and RDF and how it can be used to answer different questions. Aimed at the absolute beginner who never saw a SPARQL query or any other formal query language and show how to navigate through the basics and into productivity. We will use jupyter notebooks as a starting point but participants do not need to know python either.

Participation: in person (max 30 participants)
Presented by Jerven Bolleman

Rhea: RDF and SPARQL how to use

Is a database of biologically relevant reactions that are balanced for mass and charge, with links to the biochemical literature. Rhea is freely available as RDF and via SPARQL under the CC-BY-4.0 international license. Rhea is used by resources such as UniProt to annotate enzymatic activity of proteins. In this tutorial we will show how to access Rhea, the way it uses and extends ChEBI and links to UniProt. We will also show how other Elixir resources can be used to see more information in the chemical space including chemical substructure and similarity search.

Participation: in person (max 30 participants)
Presented by Jerven Bolleman, Dr. Anne Morgat

UniProt RDF + SPARQL how to use

UniProt is probably the largest sparql endpoint freely available in the life sciences. However, the data while central and well maintained is overwhelming. This tutorial will show people how they can use the sparql endpoint with Jupyter notebooks to answer interesting bio-chemical questions. We will focus on the data model (what is where) of UniProt and how other ontologies/databases such as the Gene Ontology, ChEBI and Rhea are reused. Each time taking a question that a researcher may ask and building different solutions on how it can be answered. Data content wise we will focus on enzymes and their activities and proteins known to be involved in human diseases.

Participation: in person (max 30 participants)
Presented by Jerven Bolleman,Dr. Anne Morgat

Knowledge Graphs, Property Graphs and HyperGraphs

Knowledge Graphs are coming of age, even though there’s still no clear single definition for them. However, the universe of how they can be effectively applied is still emerging. What was once a way of representing and sharing detailed data (use of RDF) from multiple domains is now expanding into how machine platforms can process and analyze complex sets of data towards inference and intelligent solution creation. Knowledge graphs bridge between the organization of complex facts and relations and using them to infer new knowledge and discover novel solutions that are explanatory.

Property graphs augment the original linked data triple structure by associating properties to entities and relation, effectively embedding short cuts. These are provided as a technological efficiency for storage and query. They are subsets of yet greater structures commonly referred to as HyperGraphs, which are graphs that include higher relations (HR) beyond just binary edges r(x,y), such that HR: r(x1,x2,…xn ), where n≥2. The relation arity is assumed to be ordered, effectively implying a rich relational semantic. For example, an IL-6 antagonist was used to treat a 54-year-old with COVID-related cytokine storm: this is (at least) a three-way statement relating patient/disease/treatment that requires all 3 binary relations to be conjunctive.

HyperGraphs often are based on a formal mathematical structure called a simplex, which form a precise embedding hierarchy of sub simplices, namely nodes, edges, faces, etc. Each simplex of type t contains all subsets of simplices of type, s < t below it (one can understand s as all the subsets of t), so a three-way “face” relation not only contains 3 nodes but has 3 binary edges connecting all 2-combinations of nodes. By using NamedGraphs, these simplex (vertical) hierarchies can be precisely linked semantically for inferencing. They each can also be linked “horizontally” into what are called complexes, thereby relating simple or composite entities in a multitude of meaningful ways. In other words, reasoning both up down as well as lateral linking. This dichotomy has advantages to speeding up relation processing.

Why even consider a Knowledge Hypergraph? Every edge-connected knowledge graph of size n is one of k = 2n(n-1) other potential graphs projecting on to one super hypergraph complex of n vertices. The uncertainty of which graph is coded correctly or better implies that edge graphs may often miss critical relations. HyperGraphs, though possibly carrying more relations than necessary, can hold all the critical structures, even if not fully labeled.

This tutorial will discuss the relatedness between each of the different kinds of graphs, and how one may transform effectively into another. Relevance to machine learning will also be partially addressed.

Discussion topics will include:

  • Examples of Knowledge Graphs (biomedicine) that cans be projected intelligently into HyperGraphs
    • Transitioning from r(x1, x2) to r(x1,…, xs)
    • Capturing (partial) results as indexed edges and hyperedges for cached queries
  • Limitations of property graphs that can be handled by HyperGraphs
  • Utilizing current graph stores for HyperGraphs by leveraging NamedGraphs
  • Deep Learning applications (e.g. attention-leveraging transformers and stable diffusion) that work with graph structures including HyperGraphs
  • Optimizing discovery: From edge traversals to completion of Knowledge HyperGraphs as a means to provide multi-hops and complex inferencing

Participation: in person (max 50 participants)
Presented by Eric Neumann

Machine learning with biomedical ontologies

Ontologies are increasingly being used to provide background knowledge in machine learning models. We provide an introduction to different methods that use ontologies in machine learning models. We will start the tutorial by introducing semantic similarity measures that rely on axioms in ontologies to compare domain entities. From semantic similarity, we will develop and discuss unsupervised machine learning methods that can “embed” ontologies in vector spaces to allow comparison of domain entities based on similarity in these spaces. We will introduce mOWL, a software library for machine learning with ontologies, based on which the methods we discuss can be implemented. mOWL is freely available at https://github.com/bio-ontology-research-group/mowl. Throughout the tutorial, we will use biomedical examples for hands-on tasks. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources.

Participation: in person
Presented by Sarah Alghamdi, Robert Hoehndorf, Maxat Kulmanov, Sumyyah Toonsi, Fernando Zhapa-Camacho

Mapping clinical data to rule models (MaCD4Rules)

Clinical data is expressed in many hundreds of medical electronic record formats. To encourage interoperability, knowledge portability, and the development of a cross-platform health application ecosystem, many vendors, standard development organizations, and user groups concentrate on developing exchange formats like C-CDA and, more recently, FHIR. Yet, even with such standards, interoperability remains elusive. Some have proposed the use of well-defined clinical logical models and corresponding APIs to help address this challenge.

Electronic medical record (EMR) formats are typically designed around highly regularized generic structures which are far from the mental models that clinicians have in their heads in their daily practice. Offering clinicians intuitive models allows them to participate in the production, approval, and long-term evaluation of clinical decision support rules. For instance, asking a clinician to write rules in terms of blood pressures containing both systolic and diastolic components will be generally much more productive than asking them to write rules about a FHIR resource with a resourceType of “Observation” and two components, one with a code of type CodeableConcept containing some Coding concept references for one of many deployed codes for “Systolic blood pressure”… To further complicate matters, resources may vary across versions of FHIR and may be modeled differently within a given version of FHIR. Thus, to support greater portability in health care, any given messaging model ought to be normalized into rules model, a form that can be generally consumed and produced by health applications thus reducing current degrees of freedom.

ShexMap may support such normalization by converting local data models into a more representational form that can be readily consumed by knowledge applications. ShexMaps were used in a pilot to migrate patient data from the Department of Defence to the Veteran’s Administration. In another effort, in order to enable rules written over a natural object-oriented class hierarchy, clinical data had to migrate to that model. In a recent NIH grant 5-R01-EB030529-02, we annotated the FHIR ShEx schema for Observation with ShExMap Semantic Actions that were mapped to a rules schema. This enabled automatic instance data translation to many different clinical models used by the rules.

This tutorial will introduce the problems involved in creating clinical models and translating local data into these forms, a step that is essential when developing and executing the CDS rules that run over them. Presenters will describe the landscape of existing models and provide hands-on training into the creation, debugging and execution of ShExMaps for translating from FHIR to sensible clinical models.

Clinical decision support offers great returns in reducing clinician workloads or at least providing backup when overloaded clinicians fail to remember something or take it into account in daily care. We’re trying to help by creating intuitive tools for expressing and executing these rules.

Participation: in person (max 25 participants). Hybrid might be possible.
Presented by Claude Nanjo MA MPH, Eric Prud’hommeaux

Creating Knowledge Graphs Subsets

Knowledge graphs like Wikidata have shown a successful use case for Semantic Web technologies as they can offer information from different domains which can be accessed through a SPARQL endpoint. However, their success, which made their size increase, can also hinder their practical application for researchers which can not easily process or download a portion of their data about their specific domain in order to analyze that data or combine it with other datasets. In this tutorial we will present some techniques and tools that allow researchers to define subsets of Wikidata using Shape Expressions and create them. This would also allow researchers to store specific subsets for future reference, which facilitates reproducibility. The tutorial will be divided in three parts: in the first part we will present Wikidata and its data model, as well as the Entity Schemas namespace based on ShEx that can be used to describe Wikidata subsets, in the second part we will present the tools that can be used to create subsets from those Shape Expression schemas, and in the third part, we will describe use cases and applications of the generated subsets.

Participation: in person (max 25 participants). Hybrid might be possible.
Presented by Jose Emilio Labra Gayo, Andra Waagmeester

Swiss Personalized Health Network in Action- How to design a SPHN compliant RDF Schema

Following the FAIR principles (Findable, Accessible, Interoperable and Reusable), the SPHN Data Coordination Center (DCC) is developing a decentralized infrastructure to enable collaborative research by making the meaning of health-related data understandable to both humans and machines. The SPHN Interoperability Framework is based on a strong semantic layer of information (SPHN Dataset), and a graph technology for the exchange and storage of data. In this training we will demonstrate how you can capture your semantics is a simple way and use the SPHN Tooling to represent the semantics in a formal representation, in our case a RDF schema, further we will create a human-readable visualization, rules for data validation (SHACLs) and queries (SPARQLs) for performing data quality checks.

The training is split into three parts:

1. Capturing Semantics: Typically projects first try to capture the semantics of their data in a common human readable format. This is to ensure that the barrier for entry is minimal for both technical and non-technical users. The format of choice is typically an Excel spreadsheet. This allows for domain experts and subject matter experts to focus more on the semantics of the model and ensures that the focus is on data modeling, and defining concepts and properties. In SPHN this representation is called the SPHN Dataset, in the course you will learn how to use the SPHN Dataset and extend it with the concept you need.

2. Representing Semantics: Once the semantics are well defined, the next step is to represent the semantics using a formal language. The most common data model of choice is the Resource Description Framework (RDF). Adopting the RDF as a form for knowledge representation allows for the use of RDF Schema (RDFS) for representing classes and properties. In SPHN this step can be facilitated by the Dataset2RDF tool, which generates the RDF Schema automatically from an Excel spreadsheet, without requiring in depth knowledge of RDF by the user. With the SPHN tool stack a human-readable html document is generated together with the formal RDF Schema, which can be shared within the consortia.

3. Data Validation: Ensuring data quality is of utmost importance for high-quality research. To facilitate quality control and compliance with SPHN data specifications at the data provider level, we will establish a set of Shapes Constraint Language (SHACL) rules that can be run on the data graph before they are shared with researchers. The SHACLer and SPARQLer tool automatically generates SHACLs and SPARQLs using the project’s data schema. This also facilitates the review of new project specifications and ensures harmonized data delivery from all sites.

Participation: in person (max. 25 participants)
Presented by Sabine Österle, Kristin Gnodtke, Deepak Unni, Philip Krauss, Jascha Buchhorn

Semantic Knowledge Modeling for Domain Experts in Pharma

The FAIR data principles enable companies to make data reusable and human- and machine-actionable. More and more organizations from the Pharma and Life Sciences community are striving to apply these principles internally and to utilize publicly available data to advance R&D and other functions.

One important aspect in this process is modeling domain expert knowledge – including the utilization of public knowledge in the form of ontologies and mapping this knowledge to internal terms, processes, use cases and needs.

The process of knowledge modeling (or semantic modeling) to capture the meaning of data and ensure correct interpretation and usage is a key pillar to FAIR data. This process can only succeed if domain experts / SMEs can perform the modeling task without relying on ontologists or other experienced data modelers.

In this tutorial, we will present an approach that supports this process and enables you, the domain expert, to model your respective domain. We will cover examples from R&D research and target discovery, including the use of public ontologies like Mondo or NCI, through clinical trial scoping and consent modeling, to manufacturing, supply chain management and market access.

The session will include a mix of theoretical learning, best practices and hands-on modeling exercises to familiarize you with the concepts and give you a chance to apply this knowledge.

Participation: in person (max 30 participants)
Presented by Irina Schmidt, Sebastian Schmidt