Harmonized Variant Calling for SARS-CoV-2 Genomes
August 6, 2020

To combat the current COVID-19 pandemic, scientists around the world are sequencing viral genomes at an accelerated pace.

These sequences are then being deposited into a number of international databases, including the National Center for Biotechnology Information (NCBI; Figure 1). There are limitations to this approach as multiple databases makes it challenging for a single researcher on their own to consolidate data from different sources. These data were generated and processed by different research groups at different institutions resulting in batch effects when amalgamated -- differences in signal across groups of viral genomes processed together that represent technical noise and not biological variation.  The distributed nature of the data and the lack of uniformity in data generation and processing hinders the pace of scientific discovery. To accelerate discovery, we need to leverage the breadth of data available internationally which requires data consolidation from multiple sources and data “cleaning” to reduce technical artifacts introduced during data processing.

To address this critical gap and accelerate scientific discovery, DNAstack released COVID Cloud, a cloud-based solution that uniquely indexes and integrates data from multiple international sources into a unified data lake. Identifying mutations in the viral genome can help researchers design novel therapeutics and track viral transmissions. COVID Cloud provides easily accessible viral genome data ready for analysis. The data in COVID Cloud can be browsed through apps providing different perspectives including faceted search, point lookup, and 3D visualizations. Users can also export the data into downstream analytical workspaces, such as Jupyter Notebooks, Power BI, or DNAstack’s Workflow Execution Service.


Figure 1

SARS-CoV-2 variant detection

To facilitate high-quality, reproducible variant calling at scale, we have developed and published an open source workflow written in Workflow Descriptive Language (WDL) (available on Dockstore and Github). The workflow has two sub-workflows: one to handle long-read (i.e. Nanopore) and one to handle short-read, paired-end (i.e. Illumina) sequencing data (see Nanopore variant calling and Illumina paired-end variant calling sections for more details). These workflows are designed to take NCBI run accessions as input (used to access raw FASTQ files) and return high-confident variant calls (e.g. VCF files) as well as a consensus genome sequence.

We deployed our WDL variant calling workflows to identify mutations in 10,838 amplicon-based viral sequences hosted on NCBI (10,664 unique samples). Of these, 4,427 are Illumina paired-end short read sequences and 4,471 are Nanopore long read sequences. The resulting variant calls, as well as links to per-sample VCFs and assemblies, have been made freely available for exploration and download at COVID Cloud.

To further promote scalable and reproducible science, we have published both the Illumina and Nanopore WDL workflows. Below is a brief tutorial on how to call variants locally using the DNAstack variant calling workflow.

Running the workflow

The COVID-19 variant calling workflow is available on DNAstack’s GitHub. The workflow may also be viewed on Dockstore. The tools and pipelines used in the workflow have been packaged into publicly available Docker images, allowing the workflow to be run reproducibly across compute environments. Instructions for running the pipeline locally in addition to test input files can be found in the workflow documentation. Briefly, to run the workflow locally, simply edit the input_template.json file found in the inputs directory of the GitHub repository to specify the parameters for the sample of interest, then run it with a workflow runner of your choice, e.g.:

Using Cromwell:

java -jar cromwell.jar run main.wdl -i input_template.json

Using miniwdl:

miniwdl run main.wdl -i input_template.json

The required parameters are the run accession from NCBI (explore SARS-CoV-2 run data); the library type (NANOPORE or ILLUMINA_PE) to determine which pipeline to run; and a file and version number indicating the primer scheme that was used to prepare the library.

Nanopore variant calling

Our Nanopore variant calling workflow leverages the ARTIC bioinformatics protocol (Loman et al., 2020), as implemented by the Connor lab. Briefly, reads are filtered and mapped to the SARS-CoV-2 reference genome (MN908947.3) using minimap2, following which amplicon primer sequences are trimmed. Next, medaka uses neural networks to create a consensus sequence (Oxford Nanopore Technologies, 2018). Medaka is also used to call variants, which are then fed into longshot to produce a set of high-confidence variants. Bcftools is used to generate the final consensus assembly sequence.

For more details please see the ARTIC bioinformatics protocol documentation and the GitHub for the nextflow implementation of this protocol we use in our workflow.

Illumina paired-end variant calling

Our Illumina paired-end variant calling workflow leverages the SARS-CoV-2 Illumina GeNome Assembly Line (SIGNAL) protocol, produced by the McArthur lab. Briefly, reads are mapped to the human genome (GRCh38) using BWA-MEM to remove host reads (Li and Durbin, 2009). Next, adapters are trimmed and reads are mapped to the reference genome using BWA-MEM. Primer sequences are removed and variants are called using ivar with default parameters (Grubaugh et al., 2018). Ivar is also used to produce a consensus assembly sequence.

For more details please see the SIGNAL GitHub and documentation.

We are immensely grateful to the labs and individuals responsible for creating and open-sourcing the pipelines we have chosen to run this COVID analysis.

Our goals were twofold; we wanted to produce robust, reliable data that could be freely distributed to researchers in the hope that such a large volume of data could help provide novel insight into the virus, and we wanted to provide an easy-to-use pipeline for others aiming to process their own data or to reproduce our analysis on the NCBI data. 

We will continue to iterate and improve upon our analysis pipelines, ingesting more data every day as it becomes available. In the near future, we plan to add an Illumina single-end pipeline, as well as a method to process metagenomic samples. We also plan to identify and ingest from more databases that expose raw sequencing data and metadata; while assembled genomes are valuable, per-site confidence can only be established when raw reads are available. Since our workflows are written in WDL and their environments are containerized, they can be reproducibly run in virtually any compute environment, ensuring accurate results in an infrastructure-independent way.

We believe that making science open, from the raw data, to analysis methods, to sharing results is essential not only to combat fast-moving diseases such as COVID, but also generally to increase the momentum of research across domains. It is our goal to continue to package and share best-practices workflows, analytics, data and results, so that researchers around the world can more rapidly leverage these resources that may otherwise not be available to them.


References and Further Reading