Page:Wikidata as a knowledge graph for the life sciences.pdf/2

From Wikisource
Jump to navigation Jump to search
This page needs to be proofread.

Feature Article

Science Forum Wikidata as a knowledge graph for the life sciences

across them for each query. These approaches lower the barriers to adding new data by enabling anyone to publish data by following community standards. However, performance is often an issue when each query must be sent to many individual databases, and the performance of the system as a whole is highly dependent on the stability and performance of each individual component. In addition, data integration requires harmonizing the differences in the data models and data formats between resources, a process that can often require significant skill and effort. Moreover, harmonizing differences in data licensing can sometimes be impossible. Here we explore the use of Wikidata (www. Vrandečić, 2012; Morawikidata.org; Cantallops et al., 2019) as a platform for knowledge integration in the life sciences. Wikidata is an openly-accessible knowledge base that is editable by anyone. Like its sister project Wikipedia, the scope of Wikidata is nearly boundless, with items on topics as diverse as books, actors, historical events, and galaxies. Unlike Wikipedia, Wikidata focuses on representing knowledge in a structured format instead of primarily free text. As of September 2019, Wikidata’s knowledge graph included over 750 million statements on 61 million items (tools.wmflabs.org/ wikidata-todo/stats.php). Wikidata was also the first project run by the Wikimedia Foundation (which also runs Wikipedia) to have surpassed one billion edits, achieved by a community of 12,000 active users, including 100 active computational ‘bots’ (Figure 1—figure supplement 1). As a knowledge integration platform, Wikidata combines several of the key strengths of the centralized and distributed approaches. A large portion of the Wikidata knowledge graph is based on the automated imports of large structured databases via Wikidata bots, thereby breaking down the walls of existing data silos. Since Wikidata is also based on a communityediting model, it harnesses the distributed efforts of a worldwide community of contributors, including both domain experts and bot developers. Anyone is empowered to add new statements, ranging from individual facts to large-scale data imports. Finally, all knowledge in Wikidata is queryable through a SPARQL query interface (query.wikidata.org/), which also enables distributed queries across other Linked Data resources.

Waagmeester et al. eLife 2020;9:e52614. DOI: https://doi.org/10.7554/eLife.52614

In previous work, we seeded Wikidata with content from public and authoritative sources of structured knowledge on genes and proteins (Burgstaller-Muehlbacher et al., 2016) and chemical compounds (Willighagen et al., 2018). Here, we describe progress on expanding and enriching the biomedical knowledge graph within Wikidata, both by our team and by others in the community (Turki et al., 2019). We also describe several representative biomedical use cases on how Wikidata can enable new analyses and improve the efficiency of research. Finally, we discuss how researchers can contribute to this effort to build a continuously-updated and community-maintained knowledge graph that epitomizes the FAIR principles.

The Wikidata Biomedical Knowledge Graph The original effort behind this work focused on creating and annotating Wikidata items for human and mouse genes and proteins (Burgstaller-Muehlbacher et al., 2016), and was subsequently expanded to include microbial reference genomes from NCBI RefSeq (Putman et al., 2017). Since then, the Wikidata community (including our team) has significantly expanded the depth and breadth of biological information within Wikidata, resulting in a rich, heterogeneous knowledge graph (Figure 1). Some of the key new data types and resources are described below. Genes and proteins: Wikidata contains items for over 1.1 million genes and 940 thousand proteins from 201 unique taxa. Annotation data on genes and proteins come from several key databases including NCBI Gene (Agarwala et al., 2018), Ensembl (Zerbino et al., 2018), UniProt Consortium, 2019), InterPro (UniProt (Mitchell et al., 2019), and the Protein Data Bank (Burley et al., 2019). These annotations include information on protein families, gene functions, protein domains, genomic location, and orthologs, as well as links to related compounds, diseases, and variants. Genetic variants: Annotations on genetic variants are primarily drawn from CIViC (http://www. civicdb.org), an open and community-curated database of cancer variants (Griffith et al., 2017). Variants are annotated with their relevance to disease predisposition, diagnosis, prognosis, and drug efficacy. Wikidata currently contains 1502 items corresponding to human genetic variants,

2 of 15