Page:Wikidata as a knowledge graph for the life sciences.pdf/7

From Wikisource
Jump to navigation Jump to search
This page needs to be proofread.

Feature Article

Science Forum Wikidata as a knowledge graph for the life sciences

compared the disease data in Wikidata to the most current DO release on a monthly basis. In our first comparison between Wikidata and the official DO release, we found that Wikidata users added a total of 2030 new cross references to GARD (Lewis et al., 2017) and MeSH (https://www.nlm.nih.gov/mesh/meshhome. html). These cross references were primarily added by a small handful of users through a web interface focused on identifier mapping (Manske, 2020). Each cross reference was manually reviewed by DO expert curators, and 2007 of these mappings (98.9%) were deemed correct and therefore added to the ensuing DO release. 771 of the proposed mappings could not be easily validated using simple string matching, and 754 (97.8%) of these were ultimately accepted into DO. Each subsequent monthly report included a smaller number of added cross references to GARD and MeSH, as well as ORDO (Maiella et al., 2018), and OMIM (Amberger and Hamosh, 2017; McKusick, 2007), and these entries were incorporated after expert review at a high approval rate (>90%). Addition of identifier mappings represents the most common community contribution, and likely the most accessible crowdsourcing task. However, Wikidata users also suggested numerous refinements to the ontology structure, including changes to the subclass relationships and the addition of new disease terms. These structural changes were more nuanced and therefore rarely incorporated into DO releases with no modifications. Nevertheless, they often prompted further review and refinement by DO curators in specific subsections of the ontology. The Wikidata crowdsourcing curation model is generalizable to any other external resource that is automatically synced to Wikidata. The code to detect changes and assemble reports is tracked online at https://github.com/SuLab/ scheduled-bots (archived at Stupp et al., 2020) and can easily be adapted to other domain areas. This approach offers a novel solution for integrating new knowledge into a biomedical ontology through distributed crowdsourcing while preserving control over the expert curation process. Incorporation into Wikidata also enhances exposure and visibility of the resource by engaging a broader community of users, curators, tools, and services.

Interactive pathway pages In addition to its use as a repository for data, we explored the use of Wikidata as a primary access

Waagmeester et al. eLife 2020;9:e52614. DOI: https://doi.org/10.7554/eLife.52614

and visualization endpoint for pathway data. We used Scholia, a web app for displaying scholarly profiles for a variety of Wikidata entries, including individual researchers, research topics, chemicals, and proteins (Nielsen et al., 2017). Scholia provides a more user-friendly view of Wikidata content with context and interactivity that is tailored to the entity type. We contributed a Scholia profile template specifically for biological pathways (Scholia, 2019). In addition to essential items such as title and description, these pathway pages include an interactive view of the pathway diagram collectively drawn by contributing authors. The WikiPathways identifier property in Wikidata informs the Scholia template to source a pathway-viewer widget from Toolforge (https://tools. wmflabs.org/admin/tool/pathway-viewer) that in turn retrieves the corresponding interactive pathway image. Embedded into the Scholia pathway page, the widget provides pan and zoom, plus links to gene, protein and chemical Scholia pages for every clickable molecule on the pathway diagram see, for example, Scholia (2019). Each pathway page also includes information about the pathway authors. The Scholia template also generates a participants table that shows the genes, proteins, metabolites, and chemical compounds that play a role in the pathway, as well as citation information in both tabular and chart formats. With Scholia template views of Wikidata, we were able to generate interactive pathway pages with comparable content and functionality to that of dedicated pathway databases. Wikidata provides a powerful interface to access these biological pathway data in the context of other biomedical knowledge, and Scholia templates provide rich, dynamic views of Wikidata that are relatively simple to develop and maintain.

Phenotype based disease diagnosis Phenomizer is a web application that suggests clinical diagnoses based on an array of patient phenotypes (Köhler et al., 2009). On the back end, the latest version of Phenomizer uses BOQA, an algorithm that uses ontological structure in a Bayesian network (Bauer et al., 2012). For phenotype-based disease diagnosis, BOQA takes as input a list of phenotypes (using the Human Phenotype Ontology [HPO; Köhler et al., 2017]) and an association file between phenotypes and diseases. BOQA then suggests disease diagnoses based on semantic similarity (Köhler et al., 2009). Here, we studied whether phenotype-disease associations from

7 of 15