Wikisource:WikiProject Open AccessProgrammatic import from PubMed Central/Ten Simple Rules for Taking Advantage of Git and GitHub

From Wikisource
Jump to navigation Jump to search

Introduction[edit]

Bioinformatics is a broad discipline in which one common denominator is the need to produce and/or use software that can be applied to biological data in different contexts. To enable and ensure the replicability and traceability of scientific claims, it is essential that the scientific publication, the corresponding datasets, and the data analysis are made publicly available [[1],[2]]. All software used for the analysis should be either carefully documented (e.g., for commercial software) or, better yet, openly shared and directly accessible to others [[3],[4]]. The rise of openly available software and source code alongside concomitant collaborative development is facilitated by the existence of several code repository services such as SourceForge, Bitbucket, GitLab, and GitHub, among others. These resources are also essential for collaborative software projects because they enable the organization and sharing of programming tasks between different remote contributors. Here, we introduce the main features of GitHub, a popular web-based platform that offers a free and integrated environment for hosting the source code, documentation, and project-related web content for open-source projects. GitHub also offers paid plans for private repositories (see Box 1) for individuals and businesses as well as free plans including private repositories for research and educational use.

Box 1

By default, GitHub repositories are freely visible to all. Many projects decide to share their work publicly and openly from the start of the project in order to attract visibility and to benefit from contributions from the community early on. Some other groups prefer to work privately on projects until they are ready to share their work. Private repositories ensure that work is hidden but also limit collaborations to just those users who are given access to the repository. These repositories can then be made public at a later stage, such as, for example, upon submission, acceptance, or publication of corresponding journal articles. In some cases, when the collaboration was exclusively meant to be private, some repositories might never be made publicly accessible.

GitHub relies, at its core, on the well-known and open-source version control system Git, originally designed by Linus Torvalds for the development of the Linux kernel and now developed and maintained by the Git community. One reason for GitHub’s success is that it offers more than a simple source code hosting service [[5],[6]]. It provides developers and researchers with a dynamic and collaborative environment, often referred to as a social coding platform, that supports peer review, commenting, and discussion [[7]]. A diverse range of efforts, ranging from individual to large bioinformatics projects, laboratory repositories, as well as global collaborations, have found GitHub to be a productive place to share code and ideas and to collaborate (see Table 1).

Table 1[edit]

"Bioinformatics repository examples with good practices of using GitHub.The table contains the name of the repository, the type of example (issue tracking, branch structure, unit tests), and the URL of the example. All URLs are prefixed with https://github.com/.(10.1371/journal.pcbi.1004947.t001)"
Name of the RepositoryTypeURL
AdamCommunity Project, Multiple forkshttps://github.com/bigdatagenomics/adam
BioPython [[8]]Community Project, Multiple contributorshttps://github.com/biopython/biopython/graphs/contributors
Computational Proteomics UnitLab Repositoryhttps://github.com/ComputationalProteomicsUnit
Galaxy Project [[9]]Community Project, Bioinformatics Repositoryhttps://github.com/galaxyproject/galaxy
GitHub PaperManuscript, Issue discussion, Community Projecthttps://github.com/ypriverol/github-paper
MSnbase [[10]]Individual project repositoryhttps://github.com/lgatto/MSnbase/
OpenMS [[11]]Bioinformatics Repository, Issue discussion, brancheshttps://github.com/OpenMS/OpenMS/issues/1095
PRIDE Inspector Toolsuite [[12]]Project Organization, Multiple projectshttps://github.com/PRIDE-Toolsuite
Retinal wave data repository [[13]]Individual project, Manuscript, Binary Data organizedhttps://github.com/sje30/waverepo
SAMtools [[14]]Bioinformatics Repository, Project Organizationhttps://github.com/samtools
rOpenSciCommunity Project, Issue discussionhttps://github.com/ropensci
The Global Alliance For Genomics and HealthCommunity Projecthttps://github.com/ga4gh

Some of the recommendations outlined below are broadly applicable to repository hosting services. However, our main aim is to highlight specific GitHub features. We provide a set of recommendations that we believe will help the reader to take full advantage of GitHub’s features for managing and promoting projects in bioinformatics as well as in many other research domains. The recommendations are ordered to reflect a typical development process: learning Git and GitHub basics, collaboration, use of branches and pull requests, labeling and tagging of code snapshots, tracking project bugs and enhancements using issues, and dissemination of the final results.

Rule 1: Use GitHub to Track Your Projects[edit]

The backbone of GitHub is the distributed version control system Git. Every change, from fixing a typo to a complete redesign of the software, is tracked and uniquely identified. Although Git has a complex set of commands and can be used for rather complex operations, learning to apply the basics requires only a handful of new concepts and commands and will provide a solid ground to efficiently track code and related content for research projects. Many introductory and detailed tutorials are available (see Table 2 below for a few examples). In particular, we recommend A Quick Introduction to Version Control with Git and GitHub by Blischak et al. [[5]].

Table 2[edit]

"Online courses, tutorials, and workshops about GitHub and Git for scientists.(10.1371/journal.pcbi.1004947.t002)"
Name of the MaterialURL
Git help and Git help -aDocument, installed with Git
Karl Broman’s Git/Github Guidehttp://kbroman.org/github_tutorial/
Version Control with GitVersion Control with Githttp://swcarpentry.github.io/git-novice/
Introduction to Githttp://git-scm.com/book/ch1-3.html
Github Traininghttps://training.github.com/
Github Guideshttps://guides.github.com/
Good Resources for Learning Git and GitHubhttps://help.github.com/articles/good-resources-for-learning-git-and-github/
Software Carpentry: Version Control with Githttp://swcarpentry.github.io/git-novice/

In a nutshell, initializing a (local) repository (often abbreviated as repo) marks a directory as one to be tracked (Fig 1). All or parts of its content can be added explicitly to the list of files to track.

File:Pcbi.1004947.g001
The structure of a GitHub-based project illustrating project structure and interactions with the community.
  1. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1004947.ref001
  2. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1004947.ref002
  3. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1004947.ref003
  4. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1004947.ref004
  5. 5.0 5.1 Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1004947.ref005
  6. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1004947.ref006
  7. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1004947.ref007
  8. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1004947.ref018
  9. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1004947.ref019
  10. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1004947.ref020
  11. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1004947.ref021
  12. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1004947.ref022
  13. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1004947.ref023
  14. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1004947.ref024