Tutorial 2: Describing Datasets with the Health Care and Life Sciences Community Profile

Michel Dummontier 1, Alasdair J. G. Gray 2, and M. Scott Marshall 3

1 Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USA
2 Computer Science, Heriot-Watt University, Edinburgh, UK
3 Department of Radiation Oncology, Netherlands Cancer Institute, Amsterdam, The Netherlands

Abstract
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. The goal of this tutorial is to explain elements of the HCLS community profile and to enable users to craft and validate descriptions for datasets of interest.

Keywords: Dataset description, Metadata, FAIR Data Principles, Data profile

Motivation
Big Data presents an exciting opportunity to pursue large-scale analyses over collections of data in order to uncover valuable insights across a myriad of fields and disciplines. Yet, as more and more data is made available, researchers are finding it increasingly difficult to discover and reuse these data. The W3C Health Care and Life Sciences (HCLS) Interest Group have developed a community profile that defines the required properties to provide high-quality dataset descriptions that support finding, understanding, and reusing scientific data, i.e. making the data FAIR (Findable, Accessible, Interoperable and Reusable). Many of the notions and vocabulary terms are drawn from established vocabularies: basic metadata information is provided using Dublin Core, DCAT and VoID, with provenance and versioning information being provided by PROV-O and PAV. In total the pecification reuses 18 exiting vocabularies, and the resulting dataset descriptions are expressed in a machine readable format using RDF. The specification consists of 61 metadata elements pertaining to data description, identification, licensing, attribution, conformance, versioning, provenance, and content summary.

The purpose of this tutorial is to guide the participants through the three tier model of the HCLS community profile. It will explain which metadata properties are required and optional at each level, and the values that can be provided. It will also give an overview of the validation service that is available for ensuring that a dataset description conforms to the community profile.

Tutorial Material
The tutorial will consist of a mix of slide presentations and hands-on creation of a sample dataset description. All material will be made available under a Creative Commons license, and will be publicly accessible from the W3C HCLS Interest Group pages.

The first part of the tutorial will provide an overview of the HCLS Community Profile for Dataset Descriptions. The slide presentation will introduce the audience to the three tier model and the metadata elements for each tier. The ChEMBL dataset will be used as an example. The attendees will then be given the chance to create their own dataset description. In the fashion of BYOD events, we would encourage participants to come along with their own dataset. An example will be provided for those who do not bring their own.

The second part of the tutorial will look at the validation of dataset descriptions against the community profile specification. After a brief overview of theValidata tool, the attendees will be encouraged to validate the descriptions they have created during the tutorial against the W3C hosted service (https://www.w3.org/2015/03/ShExValidata/ accessed September 2016).

The final part of the tutorial will provide an overview of the metadata statistics that can be provided for an RDF dataset. The slide presentation will explain the motivation behind providing rich dataset statistics and how they can be exploited. There will then be an opportunity for a hands-on session to run some of the SPARQL queries needed to generate the statistics.

Audience
This tutorial will primarily be of interest to data publishers; from knowledge base maintainers to researchers publishing experimental data. This includes an increasing number of researchers with the drive from funders to enable data reuse and the community push to publish data according to the FAIR data principles.

The tutorial will also be of interest to those looking to discover and reuse existing datasets. These consumers of data will gain an understanding of the properties that can enable them to identify datasets of interest and get started with them more quickly.

Participants will be expected to have a working knowledge of RDF and the turtle serialisation. A basic understanding of SPARQL will be beneficial for understanding the queries used to generate the dataset statistics. Prior knowledge of metadata vocabularies is an advantage but not essential.

Presenters
Michel Dumontier Stanford Center for Biomedical Informatics Research
Michel is an Associate Professor of Medicine (Biomedical Informatics) at Stanford University. His research focuses on developing computational methods for biomedical knowledge discovery. Michel is the co-chair for the W3C Semantic Web for Health Care and Life Sciences Interest Group and is a co-editor for the HCLS Community Profile for Dataset Descriptions. Michel leads the open source Bio2RDF project to produce linked data for life sciences and is involved in a number of national and international initiatives to increase the discoverability, accessibility, interoperability and re-useability of data and software.

Alasdair Gray Department of Computer Science, Heriot-Watt University, UK
Alasdair is one of the editors of the HCLS Community Profile for Dataset Descriptions. He also led the development of the Open PHACTS Dataset Description standard and an author of the PAV vocabulary. Alasdair is an Assistant Professor in Computer Science and teaches courses on Database Management Systems and Big Data Management. He has given academic tutorials at SWAT4LS, ESWC, and ELIXIR BYOD events.