BioHackathon – SWAT4HCLS

Welcome to the 12th SWAT4(HC)LS hackathon and the 1st Biohackathon – SWATH4HCLS. We are teaming up with the Biohackathon and it is a great honour we can now call the SWAT4HCLS hackathon the Biohackathon – SWAT4HCLS.

There has always been an overlap in the audience with the other two BioHackathons:

Date

The BioHackathon SWAT4HCLS 2023 will take place on the 4th day of the event and will be organised as an unconference

Venue

Seminarraum U1.191
Seminarraum U1.197
Seminarraum U1.193

Schedule

07:00 Online channel
Hack, collaborate, discuss, etc
Hack, collaborate, discuss, etc
09:00 Welcome and Pitches
Hack, collaborate, discuss, etc
Hack, collaborate, discuss, etc
12:00 Intermediate updates
Hack, collaborate, discuss, etc
Hack, collaborate, discuss, etc
16:30 Reports

Proposals

We are looking forward to your hack topics. Please use this form to pitch any idea.

Received Hackathon proposals

Creating life-science subsets from Wikidata discord
Wikidata is the linked-data repository of the Wikimedia foundation. It is a generic knowledge graph that since its onset in 2012 has grown into a substantial knowledge graph. One of the issues with Wikidata, except for being often too big to be handled, is that Wikidata is a moving target. This means that Wikidata is constantly in flux which can be a concern for reproducibility when Wikidata is used. Wikidata however provides regular dumps. Currently, methods exist to extract subsets from the regular wikidata dumps. This allows more manageable topical knowledge graphs from wikidata, which could allow more complex querying currently possible on Wikidata, itself. During the hackathon, we would like to describe the boundaries and subsequently extract those subsets from Wikidata.

Export and share an RDF representation of the Image Data Resource (IDR) discord
The Image Data Resource (IDR) is home to 13 million multi-dimensional image datasets. Each of these is annotated with (a subset of) Gene, Phenotype, Organism/Cell Line, Antibody, siRNA, and Chemical Compound metadata. Initial work has been performed to export this information as RDF from the data management system (OMERO) where it is stored in PostgreSQL tables using https://pypi.org/project/omero-rdf. The export of the largest single study (defined as a collection of the image datasets associated with a single publication), however, generates 100M triples. This study representing images of tissue from the Human Protein Atlas has been exported directly using SQL and parallelized scripts.

At this hackathon, we would like to:
* Review the RDF structure and URIs to prepare them for production
* Identify strategies for subsetting the RDF for various use cases
* Draft endpoints (SPARQL, bioschemas) for the consumption of the subsets and test their
scalability

Exploring the connectivity between observational data and molecular networks discord
The objective of this hackathon is to explore to what degree it is possible to build a Knowledge Graph spanning observational data to molecular data. From a standardised representation of a patient record (e.g.: in OMOP) we can derive observations (diagnosis, phenotypes, biomarkers). At a molecular level, we have a range of resources in terms of protein-protein interaction networks or pathways that can explain how some molecular alterations are causally connected. Linking these two levels, we have gene-to-disease associations, but also potentially phenotype databases or more. To what degree is it possible to formulate a holistic network such that, given a range of patient-level observations, it would be possible to trace them to the alteration of connected molecular mechanisms? To what degree it would be possible to characterise a patient in terms of overall pathways/networks affected?
Objectives of this hackathon are: 1. To assess what data resources are available, and what is publicly available or not. 2. To assess how much such data resources can be connected or not, in terms of using common ontologies. 3. To assess the coverage and sparsity of such connections. 4. To assess how much an overall graph would be informative, given the high level of abstraction (e.g.: many statements on molecular networks are de-contextualised with respect to tissue or individual characteristics)
This is an explorative hackathon. The outcome is a better understanding of what is available, gaps, potential and limitations in building a KG spanning population-to-bench data.

Hacking Scholia for custom SPARQL endpoints discord
This hack project is following up efforts of other to repurpose the Scholia platform for other SPARQL endpoints, like custom Wikibase installations or Wikidata subsets. In this project we will explore how we can make the Scholia code base configurable for the aspects to show, and which panels to show.
If the project is successful, we have a custom Scholia instance around any of the existing public SPARQL endpoints (e.g. ChEMBL, UniProt/NextProt, etc).

Expertise needed
SPARQL coding, and/or
Python coding (Flask experience particularly welcome)

(Bio)Schemas markup, specifications and consumption (with special focus on Machine Learning metadata) discord
In this SWAT4HCLS BioHackathon project we will focus on the case of metadata to describe machine learning, particularly to support the Data, Optimization, Model and Evaluation (DOME) recommendations and within the scope of the NFDI4DataScience consortium. To this end, we will be looking at existing ML portals and (Bio)schemas specifications to find the matching point, aka creating crosswalks.At the same time, we will offer support to anyone who wants to add Bioschemas markup to their own resources or to what to consume resources already marked up with Bioschemas.

Expertise needed
People familiar/interested with Bioschemas/schema.org
People familiar/interested with metadata schemas
People familiar with Machine Learning training processes

Speed up the development of biomedical use cases with Linked Life Data inventory discord
Explore a FAIR catalogue of linked life data sets and identify relevant sources of linked data to implement various use cases in the biomedical domain. Query data sets metadata to identify relevant sources of a particular category of information and volumes of data. Identify linked data sets and types of available semantic relations. Develop specific use cases like.
1. identify key opinion leaders in clinical research for specific therapeutic areas and geographic regions;
2. mine for gene-gene variant-disease associations in the context of specific biological processes;
3. mine drug targets for a specific class of molecular substances.