News

December 1, 2020

Consortium Secures $5.1M to Expand Genomics Platform for COVID Research

A national consortium led by DNAstack will expand development of a software platform for genomics and health data and apply it to COVID-19.

The $5.1M project, called COVID Cloud, is co-funded by Canada’s Digital Technology Supercluster and aims to increase Canada’s capacity to harness exponentially growing volumes of genomics and biomedical data to advance precision health. The platform will be used by data scientists and domain experts to help understand, predict, and treat COVID-19 with molecular precision. With a global death count of over 1.4 million people and record numbers of cases nationally, solutions that can help Canada respond to ongoing challenges of the pandemic are urgently needed.

News

December 1, 2020

Consortium Secures $5.1M to Expand Genomics Platform for COVID Research

A national consortium led by DNAstack will expand development of a software platform for genomics and health data and apply it to COVID-19.

The $5.1M project, called COVID Cloud, is co-funded by Canada’s Digital Technology Supercluster and aims to increase Canada’s capacity to harness exponentially growing volumes of genomics and biomedical data to advance precision health. The platform will be used by data scientists and domain experts to help understand, predict, and treat COVID-19 with molecular precision. With a global death count of over 1.4 million people and record numbers of cases nationally, solutions that can help Canada respond to ongoing challenges of the pandemic are urgently needed.

A national consortium led by DNAstack will expand development of a software platform for genomics and health data and apply it to COVID-19.

The $5.1M project, called COVID Cloud, is co-funded by Canada’s Digital Technology Supercluster and aims to increase Canada’s capacity to harness exponentially growing volumes of genomics and biomedical data to advance precision health. The platform will be used by data scientists and domain experts to help understand, predict, and treat COVID-19 with molecular precision. With a global death count of over 1.4 million people and record numbers of cases nationally, solutions that can help Canada respond to ongoing challenges of the pandemic are urgently needed.

“We are proud to continue to support this consortium’s groundbreaking work through our COVID-19 program,” said Sue Paish, CEO of the Digital Technology Supercluster. “This project shows how Canadian partnerships across multiple organizations and sectors can drive innovation, help us address global health issues, showcase Canadian expertise, and position us well to rebuild and grow our economy.”

The project — a collaboration between BioSymetrics, Centre of Genomics and Policy at McGill University, DNAstack, FACIT, Genome BC, Mannin Research, McMaster University, Microsoft Canada, Ontario Genomics, Ontario Institute for Cancer Research, Roche Canada, Sunnybrook Research Institute, and Vector Institute — brings together Canadian leaders in software engineering, artificial intelligence, cloud computing, genomics, infectious disease, pharmaceuticals, commercialization, and policy. It leverages past work of partners to address needs of infectious disease research with guidance from domain experts.

“Tools that allow us to interrogate SARS-CoV-2 at a molecular level are essential to addressing this global health crisis, both now and in the future,” said Dr. Samira Mubareka, a microbiologist and infectious diseases physician at Sunnybrook, whose team was one of the first in Canada to isolate the novel coronavirus. “The insights we will learn by analysing integrated datasets using technology platforms like COVID Cloud can increase our preparedness for future waves and outbreaks.” Dr. Mubareka will co-chair the project’s translational science efforts along with Dr. Gabriel Musso, Chief Scientific Officer for BioSymetrics. “The infrastructure developed by this initiative will propel collaborative Canadian drug discovery efforts for COVID-19,” said Musso, whose team will lead bioinformatics and computational drug discovery for the project.

A major goal of the project is to make it easy for producers of genomic and health data to share data responsibly over industry standards, and for researchers to harness the collective power of information shared through them. The project deliverables include a suite of software products powered by enterprise-grade implementations of standards developed by Global Alliance for Genomics & Health (GA4GH), protocols that are being designed to facilitate the responsible sharing of genomic and health data, which will help advance precision medicine initiatives around the world.

“The platform is being built on a foundation of open standards that will allow for distributed networks of genomics and biomedical data to be built,” said Dr. Marc Fiume, CEO at DNAstack, whose team will lead software engineering for the project. “We are excited to see these technologies breaking down barriers to data sharing, access, and analysis and create new opportunities for genomics-based discoveries for our partners.”

This project is responding to global demand for highly specialized, scalable, distributed software infrastructure to support collaborative genomics research — a need that has surged since the onset of the COVID-19 pandemic. “COVID-19 has accelerated digital transformation of many industries, especially in healthcare,” said Kevin Peesker, President of Microsoft Canada. “The incredible power of Cloud applied to COVID at scale is expanding development of an information superhighway to securely connect scientists in Canada and around the world to the data and compute power they urgently need to help us overcome one of the greatest global health crises of our time.”

The platform will be used to support a series of projects in partnership with Canadian academic, clinical, and pharmaceutical collaborators, which are being coordinated by Canadian genome centres, Genome British Columbia and Ontario Genomics. These initial projects are being prioritized based on urgency and potential impact on Canada’s response to the COVID-19 pandemic.

“The COVID Cloud is an incredible platform that brings together resources and capacity to enable timely and comprehensive genomic analysis of SARS-CoV-2 for our province and our country,” said Bettina Hamelin, President and CEO of Ontario Genomics, whose team leads the ONCoV Genomics Coalition. “This made-in-Canada solution will immediately accelerate Canada’s response to COVID-19, while being a technological springboard for translating genomic data analysis into actionable medical insights across other disease areas in years to come.”

For more information, visit here.

About DNAstack

DNAstack’s mission is to improve the lives of millions of people by breaking down barriers to data sharing and discovery. DNAstack develops standards and technologies for scientists to more efficiently find, access, and analyze the world’s exponentially growing volumes of genomic and biomedical data. For additional support or partnership interest, please contact us by email to info@dnastack.com.

About Digital Technology Supercluster

The Digital Technology Supercluster solves some of industry's and society's biggest problems through Canadian-made technologies. We bring together private and public sector organizations of all sizes to address challenges facing Canada's economic sectors including healthcare, natural resources, manufacturing and transportation. Through this 'collaborative innovation' the Supercluster helps to drive solutions better than any single organization could on its own. The Digital Technology Supercluster is led by industry leaders such as D-Wave, Finger Food Advanced Technology Group, LifeLabs, LlamaZOO, Lululemon, MDA, Microsoft, Mosaic Forest Management, Sanctuary AI, Teck Resources Limited, TELUS,Terramera, and 1Qbit. Together, we work to position Canada as a global hub for digital innovation. A full list of Members can be found here.

About the COVID-19 Program

The COVID-19 Program aims to improve the health and safety of Canadians and support Canada's ability to address issues created by the COVID-19 outbreak. In addition, the program will build expertise and capacity to anticipate and address issues that may arise in future health crises, from healthcare to a return to work and community. More information can be found here.

News

December 1, 2020

Consortium Secures $5.1M to Expand Genomics Platform for COVID Research

A national consortium led by DNAstack will expand development of a software platform for genomics and health data and apply it to COVID-19.

The $5.1M project, called COVID Cloud, is co-funded by Canada’s Digital Technology Supercluster and aims to increase Canada’s capacity to harness exponentially growing volumes of genomics and biomedical data to advance precision health. The platform will be used by data scientists and domain experts to help understand, predict, and treat COVID-19 with molecular precision. With a global death count of over 1.4 million people and record numbers of cases nationally, solutions that can help Canada respond to ongoing challenges of the pandemic are urgently needed.

News

December 1, 2020

Consortium Secures $5.1M to Expand Genomics Platform for COVID Research

A national consortium led by DNAstack will expand development of a software platform for genomics and health data and apply it to COVID-19.

The $5.1M project, called COVID Cloud, is co-funded by Canada’s Digital Technology Supercluster and aims to increase Canada’s capacity to harness exponentially growing volumes of genomics and biomedical data to advance precision health. The platform will be used by data scientists and domain experts to help understand, predict, and treat COVID-19 with molecular precision. With a global death count of over 1.4 million people and record numbers of cases nationally, solutions that can help Canada respond to ongoing challenges of the pandemic are urgently needed.

Press Releases

October 22, 2020

Digital Technology Supercluster Makes $10 Million Investment, Rounding Out $60 Million COVID-19 Program

The Digital Technology Supercluster has made $10.7 million in follow-on investments to five projects under its COVID-19 stream, rounding out the Supercluster’s $60 million budget for the pandemic-focused program.

Press Releases

October 22, 2020

Digital Technology Supercluster Makes $10 Million Investment, Rounding Out $60 Million COVID-19 Program

The Digital Technology Supercluster has made $10.7 million in follow-on investments to five projects under its COVID-19 stream, rounding out the Supercluster’s $60 million budget for the pandemic-focused program.

Read the full article on Betakit.

Press Releases

October 22, 2020

Digital Technology Supercluster Makes $10 Million Investment, Rounding Out $60 Million COVID-19 Program

The Digital Technology Supercluster has made $10.7 million in follow-on investments to five projects under its COVID-19 stream, rounding out the Supercluster’s $60 million budget for the pandemic-focused program.

Press Releases

October 22, 2020

Digital Technology Supercluster Makes $10 Million Investment, Rounding Out $60 Million COVID-19 Program

The Digital Technology Supercluster has made $10.7 million in follow-on investments to five projects under its COVID-19 stream, rounding out the Supercluster’s $60 million budget for the pandemic-focused program.

News

October 17, 2020

Joint Genotyping 10K Whole Genome Sequences Using Sentieon on Google: Strategies for Analyzing Large Sample Sets

News

October 17, 2020

Joint Genotyping 10K Whole Genome Sequences Using Sentieon on Google: Strategies for Analyzing Large Sample Sets

Despite advances in sequencing and analysis tools, calling variants in whole-genome sequencing (WGS) data is not trivial, even when dealing with only a few dozen samples.

When the number of samples reaches into the thousands, the time, computational resources, and file storage required for analysis can quickly become overwhelming. This was the challenge faced by the MSSNG team when they sought to joint-call the largest autism cohort yet sequenced — how could they process nearly 10,000 samples in a way that would be quick, reproducible, and allow for future expansion, all without breaking the bank?

The pipeline

One of the key directives of the initiative was to allow for painless future expansion of the dataset — namely, adding new samples without full reprocessing of the entire cohort. In addition, these outputs should be reproducible and consistent across sequencing technologies and analysis tools, so data from multiple experiments across time, labs, and experimental conditions could be combined and jointly analyzed. To that end, MSSNG researchers chose to analyze their WGS data using standards defined by the Centers for Common Disease Genomics (CCDG). The CCDG provides a set of standardized data processing steps for WGS data with a focus on producing functionally equivalent results (Regier et al., 2018). These steps cover the alignment, duplicate marking, and base quality score recalibration (BQSR) tasks that convert the raw FASTQ data to CRAM-format alignment files that may be used for long-term storage and future reanalysis (Figure 1).

Figure 1: Pipeline outline. Paired FASTQ files from each sample are aligned to the reference genome to produce CRAM files. Variants are called for each CRAM to produce gVCFs for each sample, which are then combined and joint-genotyped to produce a VCF file. VQSR is performed to produce a final recalibrated VCF file.

Though not part of the CCDG pipeline itself, the CRAM output from this upstream pipeline is used to call variants (SNPs and small indels) on a per-sample basis, outputting genomic VCF (gVCF) files. Finally, the gVCF files for all samples are combined and joint-called to produce a single VCF file. Optionally, variant quality scores are then recalibrated (variant quality score recalibration, VQSR). See Figure 1 for an overview of the pipeline steps.

The tools

After extensive testing of concordance, cost, and speed, MSSNG chose to use Sentieon to process their WGS samples. Sentieon provides a licensed toolset that implements computationally-optimized versions of common variant-calling tools, providing results up to 10x faster than GATK’s best-practices pipeline while maintaining high concordance with GATK’s results (Freed et al., 2017). Sentieon publishes comprehensive documentation outlining how to run a CCDG-compliant upstream pipeline, as well as information on common downstream analysis steps such as per-sample SNP and indel calling using HaplotypeCaller, and joint genotyping and VQSR using their GVCFtyper and VarCal algorithms.

Challenges and optimizations

Upstream Pipeline: FASTQ -> CRAM, GVCF

In the upstream part of the pipeline, raw FASTQ files are processed to per-sample CRAMs and gVCFs. This segment ran smoothly using Sentieon, taking an average of 4 hours per sample (64 core virtual machine (VM), 55 GB of RAM). In rare cases (~30/9,625 total samples) the alignment step ran out of memory and RAM was increased for these samples. Since many of the Sentieon algorithms are I/O-bound (that is, they are bottlenecked by the speed of reading and writing to the disk, rather than by CPU or memory usage), we also chose to use local SSDs for storage, which provide very fast I/O speeds.

We were able to run the upstream pipeline using preemptible VMs, a machine type that is provided at a much lower cost by Google but which may be shut down at any time if the resources are needed elsewhere. If a VM is shut down in this way, all progress on a task is lost and the task will be automatically restarted on another VM. If a VM is preempted frequently enough, the cost and time lost from running and rerunning the task can outweigh the savings of using a preemptible VM. Out of 8,377 successful runs that we inspected, we found that 7,163 runs were not preempted in any step. The average raw compute cost for these runs was $2.43 USD including storage, CPU, and RAM costs (not including the price of the Sentieon licence itself). We also observed that larger VMs (such as the 64 CPU/55GB RAM VMs used for the Sentieon steps) showed far less preemption events than smaller ones. The upstream pipeline was run in parallel, with ~500 samples run concurrently.

Downstream Pipeline: Joint Genotyping

The majority of complications occurred during the joint genotyping step, which requires merging and joint genotyping all gVCF files generated using the upstream pipeline (1 per sample). Whereas the upstream pipeline can be run massively in parallel with each sample in a separate VM, joint genotyping requires the presence of all of the data in a single VM. This raises two issues: 1) disk size required, and 2) the runtime of the pipeline.

Disk Size Requirements

With each gzipped input gVCF file taking 15–25GB of space, the disk space required for analysis runs into the tens of terabytes for input files alone. The size of the merged output file, which is around the same as the sum of all the input files, must also be considered. While the Google disk size limit of 64 TB per VM should be enough to accommodate this, the size of the output file would make it unwieldy.

Pipeline Runtime

Despite the optimizations implemented by Sentieon, the speed of many processes is limited by the speed of the zip/unzip process (which by default runs on a single core) as both input and output files are gzipped to save space. This reality dramatically slows down the analysis, especially given the size of the files involved.

Solutions

Splitting Up Joint-Genotyping By Region

Sentieon provides a number of built-in solutions that help manage both the size of the final VCF file, as well as the speed of the analysis. First, joint genotyping may be split up to operate independently on different regions of the genome (much like many of GATK’s tools, which allow the analysis to be split up over intervals). This means that 1) the joint genotyping analysis may be run in parallel across intervals, and 2) we do not need to localize the full gVCF file for every sample in every shard — only the region corresponding to the interval we are joint calling in that shard (Figure 2a).

Figure 2: Solutions for joint genotyping large cohorts using Sentieon. Compare these steps to the progression from gVCFs -> Recalibrated VCF in Figure 1. a) Parallelization of joint-calling. gVCFs are broken up by region and joint genotyping is run in parallel on small regions to produce a series of partial VCFs. Partial VCFs covering a chromosome are then merged to produce a ‘main’ and a ‘samples’ file for each chromosome. b) Structure of the ‘main’ and ‘samples’ files produced by merging joint-genotyped partial VCFs. The ‘main’ file is small, containing only columns 1–9 of a normal VCF file. The ‘samples’ file contains all sample columns (rows 10-the end of the file) — 9621 total columns in our case. c) VQSR and extraction of the final VCF files. The ‘main’ files from each chromosome are merged and VQSR is performed. The recalibrated VCF is split by chromosome to generate ‘recalibrated main’ files, which are combined with the ‘samples’ files for each chromosome to produce a single full recalibrated VCF file for each chromosome.

In order to read in only the required regions of the gVCFs without localizing the full files, we took advantage of a feature of htslib which allows bcftools to read directly from Google Cloud Storage locations. bcftools accesses Google credentials using the environment variable GCS_OAUTH_TOKEN, which can be defined as follows (assuming the user has authenticated with Google Cloud):

export GCS_OAUTH_TOKEN=$(gcloud auth application-default print-access-token)

To localize only the desired region of each gVCF file, the following command is used for each sample’s gVCF URL:

bcftools view -R ${region}.bed -Oz -o ${sample}_${region}.g.vcf.gz ${gvcf_url}

Each region.bed should specify a different region of the genome, e.g. chr1:1–50000000. The result of running this command for each gVCF URL is a smaller gVCF file that only includes calls for the region specified in the region.bed file. All of these partial gVCF files are then joint-genotyped together to output a partial VCF that has calls only for the specified region (Figure 2a). Since joint-genotyping is in this way split up into many smaller jobs that can be run in parallel for each region, the process is made considerably faster.

Merging Joint-Genotyped Files by Chromosome

Once the partial VCFs for each region are produced, they must be merged together to form a final, complete VCF file that includes all regions. After some trial and error it was decided that rather than merging all regions of the genome together to form one large final VCF file, genomic regions would be merged on a per-chromosome basis in order to output 26 final VCF files (22 autosomes, chrX, chrY, chrM, and contig regions) (Figure 2a). Although the VCFs for each chromosome are still quite large, they are individually much more manageable than a single VCF containing all regions.

Since both the partial VCFs produced by joint-calling and the merged output file are gzipped, the process of merging these partial VCFs into a single file takes several days to a week to process even a single chromosome — once again, the speed of analysis is bound by the speed of the gzipping process. Had the merge step been run for the full dataset to produce a single output file, we predicted that this step would have taken upwards of a month to complete. By merging only the partial VCFs that made up each chromosome, we were able to run the process in parallel, meaning that the merge step took a little over a week to complete.

Sentieon's " Split_By_Sample" Option For Large VCF'S

It should be noted that Sentieon provides a different method of reducing the size of the final VCF file — they allow the output VCF to be split by sample, rather than by chromosome. This would result in a single ‘main’ file that contains only the first nine columns of the VCF (CHROM, START, STOP, etc.), and then a number of ‘samples’ files that each contain the calls for n samples (n could be 100, 500, 1000, etc. — the smaller the number of samples per file, the smaller the size of each file). Both the ‘main’ file and each of the ‘samples’ files have the data for all chromosomes present, but since each only contains a subset of the samples, each file is quite a bit smaller (Figure 2b). Since none of these files are valid VCF files, an ‘extraction’ step must be performed in order to produce a valid VCF file by combining the first 9 columns from the ‘main’ file along with the desired sample columns from the various ‘samples’ files. Although here we chose to split the final VCF by chromosome rather than by sample to reduce final file size (since we required final VCFs that included every sample), we still took advantage of Sentieon’s ‘split_by_sample’ option because of the implications for VQSR runtime.

Running VQSR On A Large DataSet

It has been mentioned that one of our biggest problems in dealing with a cohort of this size is the bottleneck introduced by the speed of unzipping/zipping large input and output files. We were able to improve the speed of our analysis by processing smaller regions of the genome in parallel rather than trying to read the entire genome sequentially. Coming up to the VQSR step however, we realized that in order to perform this step, which recalibrates variant quality scores across the entire dataset, information from the entire genome must be read in — that is, a single VCF including all chromosomes must be used as input, and a single VCF with all chromosomes would be output. Not only would we lose the benefit of having our VCF files split by chromosome and therefore more manageable in size since we would need the full VCF, this step would again take potentially weeks to read and write these massive gzipped VCF files.

This is where Sentieon’s ‘split_by_sample’ option came in handy. Although we wanted all samples together in the final VCF and so should not have needed this option, by using it to output all sample information in a single ‘samples’ file (containing only the sample IDs and call information, that is, all columns 10-onwards in the VCF), we were also able to produce a ‘main’ file for each chromosome. The ‘samples’ file for each chromosome is quite large since it contains all calls for all samples, however the ‘main’ file is at most a few hundred MBs and contains only columns 1–9 (Figure 2b). This file is all that is needed to perform VQSR, and since it is so much smaller, a process that may have taken weeks to complete could be performed in less than a day.

The ‘main’ file for all chromosomes was combined into a single VCF-like file that contained columns 1–9 for the entire genome; VQSR was performed on this file; finally, the full-genome ‘recalibrated main’ VCF file was split once more by chromosome, to output one ‘recalibrated main’ file per chromosome. These ‘recalibrated main’ files were combined with their respective ‘samples’ files for each chromosome using Sentieon’s extraction script in order to produce a single recalibrated VCF for each chromosome, each of which includes all samples (Figure 2c). Although this extraction process was still bound by the speed of the gzip/gunzip process, it was again able to be performed in parallel across chromosomes to reduce the total runtime needed.

Joint-genotyping the MSSNG cohort was an intensive effort that involved a collaboration between DNAstack, Sentieon, and MSSNG researchers. Sentieon’s CCDG-compliant algorithms allowed for quick, reproducible results that will support straightforward expansion in the future. The speed of the gzip process, as well as the size of output files, required creative solutions to common problems in order to improve speed and workflow costs; we look forward to further improvements in optimizing these processes for large cohorts to help accelerate research.

References

Freed, D., Aldana, R., Weber, J.A. and Edwards, J.S. 2017. The Sentieon Genomics Tools — A fast and accurate solution to variant calling from next-generation sequence data. bioRxiv. doi: https://doi.org/10.1101/115717
Regier, A.A., Farjoun, Y., Larson, D.E., Krasheninina, O., Kang, H.M., Howrigan, D.P., Chen, B-J., Kher, M., Banks, E., Ames, D.C., English, A.C., Li, H., Xing, J., Zhang, Y., Matise, T., Abecasis, G.R., Salerno, W., Zody, M.C., Neale, B.M. & Hall, I.M. 2018. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nature Communications. 9: 4038. doi: https://doi.org/10.1038/s41467-018-06159-4
htslib: https://github.com/samtools/htslib
bcftools: https://github.com/samtools/bcftools

Photo Credits

Microscopic image of crystallized DNA from autism genes. Photo Credit: Bianca Guimarae

News

October 17, 2020

Joint Genotyping 10K Whole Genome Sequences Using Sentieon on Google: Strategies for Analyzing Large Sample Sets

News

October 17, 2020

Joint Genotyping 10K Whole Genome Sequences Using Sentieon on Google: Strategies for Analyzing Large Sample Sets

News

October 2, 2020

DNAstack Showcases Standards-Based System For Federated Discovery and Analysis in GA4GH 2020 Connection Demos

Today, DNAstack revealed a software system built on standards developed by the Global Alliance for Genomics and Health (GA4GH) to enable federated analysis of genomics and related health data.

The system, demonstrated at the 8th Plenary Meeting of GA4GH earlier this week, features the first integration of multiple emerging standards for data access, discovery, and cloud computing, which are foundational for connecting secure, scalable, and distributed data sharing networks.

News

October 2, 2020

DNAstack Showcases Standards-Based System For Federated Discovery and Analysis in GA4GH 2020 Connection Demos

Today, DNAstack revealed a software system built on standards developed by the Global Alliance for Genomics and Health (GA4GH) to enable federated analysis of genomics and related health data.

The system, demonstrated at the 8th Plenary Meeting of GA4GH earlier this week, features the first integration of multiple emerging standards for data access, discovery, and cloud computing, which are foundational for connecting secure, scalable, and distributed data sharing networks.

Today, DNAstack revealed a software system built on standards developed by the Global Alliance for Genomics and Health (GA4GH) to enable federated analysis of genomics and related health data.

The system, demonstrated at the 8th Plenary Meeting of GA4GH earlier this week, features the first integration of multiple emerging standards for data access, discovery, and cloud computing, which are foundational for connecting secure, scalable, and distributed data sharing networks.

Enabling federated data sharing and analysis is important for advancing genomic research, where institutional and regulatory policies can restrict the use of valuable datasets that would otherwise stay siloed from each other. GA4GH has brought together a global community of collaborators that have worked for years to design domain-specific standards that facilitate the responsible sharing of genomic and related health information.

“We’re excited to debut an integrated system of GA4GH standards to help break through long-standing barriers to federated analysis of genomic data,” said Max Barkley, Senior Software Developer, Technical Lead at DNAstack, and co-lead of GA4GH Federated Analysis Systems Project (FASP). “This marks a major milestone for DNAstack in our mission to accelerate genomics medicine through data sharing.”

The DNAstack system was demonstrated through a real-world analysis of controlled access data hosted on multiple cloud platforms, including from the Autism Speaks MSSNG Project. The system was presented as one of three GA4GH 2020 Connection Demos. The two other demonstrations showed reproducibility of a bioinformatics analysis run in multiple environments, and multi-directional interoperability by combining implementations from different organizations.

“The Connection Demos are an enormous success for the members of the GA4GH Work Streams, who have collectively dedicated thousands of hours over the last three years toward standards development,” said Ewan Birney, Deputy Director General of the European Molecular Biology Laboratory (EMBL), Director of EMBL’s European Bioinformatics Institute (EMBL-EBI), and Chair of GA4GH. “The demos show how this community’s work will enable interoperability across the genomics endeavour.”

The GA4GH 2020 Connection Demos highlighted how standards can be used vertically and horizontally to share data while complying with institutional, regional, national, and international regulations as well as across cloud and analytics environments. Data sharing across platforms and institutions will enable the research community to access and analyze the tens of millions of genome sequences that have been generated for research and healthcare purposes, which has the potential to rapidly accelerate our scientific understanding, particularly in rare and complex diseases.

About DNAstack

DNAstack’s mission is to improve the lives of millions of people by breaking down barriers to data sharing and discovery. DNAstack develops standards and technologies for scientists to more efficiently find, access, and analyze the world’s exponentially growing volumes of genomic and biomedical data. For additional support or partnership interest, please contact us by email to info@dnastack.com.

News

October 2, 2020

DNAstack Showcases Standards-Based System For Federated Discovery and Analysis in GA4GH 2020 Connection Demos

Today, DNAstack revealed a software system built on standards developed by the Global Alliance for Genomics and Health (GA4GH) to enable federated analysis of genomics and related health data.

The system, demonstrated at the 8th Plenary Meeting of GA4GH earlier this week, features the first integration of multiple emerging standards for data access, discovery, and cloud computing, which are foundational for connecting secure, scalable, and distributed data sharing networks.

News

October 2, 2020

DNAstack Showcases Standards-Based System For Federated Discovery and Analysis in GA4GH 2020 Connection Demos

Today, DNAstack revealed a software system built on standards developed by the Global Alliance for Genomics and Health (GA4GH) to enable federated analysis of genomics and related health data.

The system, demonstrated at the 8th Plenary Meeting of GA4GH earlier this week, features the first integration of multiple emerging standards for data access, discovery, and cloud computing, which are foundational for connecting secure, scalable, and distributed data sharing networks.

News

August 11, 2020

COVID-19 Researchers Get a Boost From AI-Powered Genomics Cloud

DNAstack is helping scientists around the globe better understand COVID-19, so they can develop treatments and vaccines.

News

August 11, 2020

COVID-19 Researchers Get a Boost From AI-Powered Genomics Cloud

DNAstack is helping scientists around the globe better understand COVID-19, so they can develop treatments and vaccines.

News

August 11, 2020

COVID-19 Researchers Get a Boost From AI-Powered Genomics Cloud

DNAstack is helping scientists around the globe better understand COVID-19, so they can develop treatments and vaccines.

News

August 11, 2020

COVID-19 Researchers Get a Boost From AI-Powered Genomics Cloud

DNAstack is helping scientists around the globe better understand COVID-19, so they can develop treatments and vaccines.

News

August 6, 2020

Harmonized Variant Calling for SARS-CoV-2 Genomes

To combat the current COVID-19 pandemic, scientists around the world are sequencing viral genomes at an accelerated pace.

These sequences are then being deposited into a number of international databases, including the National Center for Biotechnology Information (NCBI; Figure 1). There are limitations to this approach as multiple databases makes it challenging for a single researcher on their own to consolidate data from different sources. These data were generated and processed by different research groups at different institutions resulting in batch effects when amalgamated -- differences in signal across groups of viral genomes processed together that represent technical noise and not biological variation. The distributed nature of the data and the lack of uniformity in data generation and processing hinders the pace of scientific discovery. To accelerate discovery, we need to leverage the breadth of data available internationally which requires data consolidation from multiple sources and data “cleaning” to reduce technical artifacts introduced during data processing.

News

August 6, 2020

Harmonized Variant Calling for SARS-CoV-2 Genomes

To combat the current COVID-19 pandemic, scientists around the world are sequencing viral genomes at an accelerated pace.

These sequences are then being deposited into a number of international databases, including the National Center for Biotechnology Information (NCBI; Figure 1). There are limitations to this approach as multiple databases makes it challenging for a single researcher on their own to consolidate data from different sources. These data were generated and processed by different research groups at different institutions resulting in batch effects when amalgamated -- differences in signal across groups of viral genomes processed together that represent technical noise and not biological variation. The distributed nature of the data and the lack of uniformity in data generation and processing hinders the pace of scientific discovery. To accelerate discovery, we need to leverage the breadth of data available internationally which requires data consolidation from multiple sources and data “cleaning” to reduce technical artifacts introduced during data processing.

To combat the current COVID-19 pandemic, scientists around the world are sequencing viral genomes at an accelerated pace.

These sequences are then being deposited into a number of international databases, including the National Center for Biotechnology Information (NCBI; Figure 1). There are limitations to this approach as multiple databases makes it challenging for a single researcher on their own to consolidate data from different sources. These data were generated and processed by different research groups at different institutions resulting in batch effects when amalgamated -- differences in signal across groups of viral genomes processed together that represent technical noise and not biological variation. The distributed nature of the data and the lack of uniformity in data generation and processing hinders the pace of scientific discovery. To accelerate discovery, we need to leverage the breadth of data available internationally which requires data consolidation from multiple sources and data “cleaning” to reduce technical artifacts introduced during data processing.

To address this critical gap and accelerate scientific discovery, DNAstack released COVID Cloud, a cloud-based solution that uniquely indexes and integrates data from multiple international sources into a unified data lake. Identifying mutations in the viral genome can help researchers design novel therapeutics and track viral transmissions. COVID Cloud provides easily accessible viral genome data ready for analysis. The data in COVID Cloud can be browsed through apps providing different perspectives including faceted search, point lookup, and 3D visualizations. Users can also export the data into downstream analytical workspaces, such as Jupyter Notebooks, Power BI, or DNAstack’s Workflow Execution Service.

Figure 1

SARS-CoV-2 variant detection

To facilitate high-quality, reproducible variant calling at scale, we have developed and published an open source workflow written in Workflow Descriptive Language (WDL) (available on Dockstore and Github). The workflow has two sub-workflows: one to handle long-read (i.e. Nanopore) and one to handle short-read, paired-end (i.e. Illumina) sequencing data (see Nanopore variant calling and Illumina paired-end variant calling sections for more details). These workflows are designed to take NCBI run accessions as input (used to access raw FASTQ files) and return high-confident variant calls (e.g. VCF files) as well as a consensus genome sequence.

We deployed our WDL variant calling workflows to identify mutations in 10,838 amplicon-based viral sequences hosted on NCBI (10,664 unique samples). Of these, 4,427 are Illumina paired-end short read sequences and 4,471 are Nanopore long read sequences. The resulting variant calls, as well as links to per-sample VCFs and assemblies, have been made freely available for exploration and download at COVID Cloud.

To further promote scalable and reproducible science, we have published both the Illumina and Nanopore WDL workflows. Below is a brief tutorial on how to call variants locally using the DNAstack variant calling workflow.

Running the workflow

The COVID-19 variant calling workflow is available on DNAstack’s GitHub. The workflow may also be viewed on Dockstore. The tools and pipelines used in the workflow have been packaged into publicly available Docker images, allowing the workflow to be run reproducibly across compute environments. Instructions for running the pipeline locally in addition to test input files can be found in the workflow documentation. Briefly, to run the workflow locally, simply edit the input_template.json file found in the inputs directory of the GitHub repository to specify the parameters for the sample of interest, then run it with a workflow runner of your choice, e.g.:

Using Cromwell:

java -jar cromwell.jar run main.wdl -i input_template.json

Using miniwdl:

miniwdl run main.wdl -i input_template.json

The required parameters are the run accession from NCBI (explore SARS-CoV-2 run data); the library type (NANOPORE or ILLUMINA_PE) to determine which pipeline to run; and a file and version number indicating the primer scheme that was used to prepare the library.

Nanopore variant calling

Our Nanopore variant calling workflow leverages the ARTIC bioinformatics protocol (Loman et al., 2020), as implemented by the Connor lab. Briefly, reads are filtered and mapped to the SARS-CoV-2 reference genome (MN908947.3) using minimap2, following which amplicon primer sequences are trimmed. Next, medaka uses neural networks to create a consensus sequence (Oxford Nanopore Technologies, 2018). Medaka is also used to call variants, which are then fed into longshot to produce a set of high-confidence variants. Bcftools is used to generate the final consensus assembly sequence.

For more details please see the ARTIC bioinformatics protocol documentation and the GitHub for the nextflow implementation of this protocol we use in our workflow.

Illumina paired-end variant calling

Our Illumina paired-end variant calling workflow leverages the SARS-CoV-2 Illumina GeNome Assembly Line (SIGNAL) protocol, produced by the McArthur lab. Briefly, reads are mapped to the human genome (GRCh38) using BWA-MEM to remove host reads (Li and Durbin, 2009). Next, adapters are trimmed and reads are mapped to the reference genome using BWA-MEM. Primer sequences are removed and variants are called using ivar with default parameters (Grubaugh et al., 2018). Ivar is also used to produce a consensus assembly sequence.

For more details please see the SIGNAL GitHub and documentation.

We are immensely grateful to the labs and individuals responsible for creating and open-sourcing the pipelines we have chosen to run this COVID analysis.

Our goals were twofold; we wanted to produce robust, reliable data that could be freely distributed to researchers in the hope that such a large volume of data could help provide novel insight into the virus, and we wanted to provide an easy-to-use pipeline for others aiming to process their own data or to reproduce our analysis on the NCBI data.

We will continue to iterate and improve upon our analysis pipelines, ingesting more data every day as it becomes available. In the near future, we plan to add an Illumina single-end pipeline, as well as a method to process metagenomic samples. We also plan to identify and ingest from more databases that expose raw sequencing data and metadata; while assembled genomes are valuable, per-site confidence can only be established when raw reads are available. Since our workflows are written in WDL and their environments are containerized, they can be reproducibly run in virtually any compute environment, ensuring accurate results in an infrastructure-independent way.

We believe that making science open, from the raw data, to analysis methods, to sharing results is essential not only to combat fast-moving diseases such as COVID, but also generally to increase the momentum of research across domains. It is our goal to continue to package and share best-practices workflows, analytics, data and results, so that researchers around the world can more rapidly leverage these resources that may otherwise not be available to them.

Resources

SIGNAL pipeline: https://github.com/jaleezyy/covid-19-signal
ARTIC pipeline: https://github.com/connor-lab/ncov2019-artic-nf
DNAstack COVID-19 workflow: https://github.com/DNAstack/covid-processing-pipeline
DNAstack workflow on Dockstore: https://dockstore.org/workflows/github.com/DNAstack/covid-processing-pipeline/covid-19-varcal:master

References and Further Reading

Grubaugh, N.D., Gangavarapu, K., Quick, J., Matteson, N.L., De Jesus, J.G., Main, B.J., Tan, A.L., Paul, L.M., Brackney, D.E., Grewal, S., Gurfield, N., Van Rompay, K.K.A., Isern, S., Michael, S.F., Coffey, L.L., Loman, N.J. and Anderson, K.G. 2018. An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar. Genome Biology. 20(8). https://doi.org/10.1186/s13059-018-1618-7
Li, H. and Durbin, R. 2009. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 25(14): 1754–1760. https://doi.org/10.1093/bioinformatics/btp324
Loman, N., Rowe, W. and Rambaut, A. nCoV-2019 novel coronavirus bioinformatics protocol. Arctic Network. https://artic.network/ncov-2019/ncov2019-bioinformatics-sop.html. Published January 23, 2020. Accessed July 3, 2020.
Oxford Nanopore Technologies Ltd. 2018. Medaka. https://nanoporetech.github.io/medaka/index.html. Accessed July 3, 2020.

News

August 6, 2020

Harmonized Variant Calling for SARS-CoV-2 Genomes

To combat the current COVID-19 pandemic, scientists around the world are sequencing viral genomes at an accelerated pace.

These sequences are then being deposited into a number of international databases, including the National Center for Biotechnology Information (NCBI; Figure 1). There are limitations to this approach as multiple databases makes it challenging for a single researcher on their own to consolidate data from different sources. These data were generated and processed by different research groups at different institutions resulting in batch effects when amalgamated -- differences in signal across groups of viral genomes processed together that represent technical noise and not biological variation. The distributed nature of the data and the lack of uniformity in data generation and processing hinders the pace of scientific discovery. To accelerate discovery, we need to leverage the breadth of data available internationally which requires data consolidation from multiple sources and data “cleaning” to reduce technical artifacts introduced during data processing.

News

August 6, 2020

Harmonized Variant Calling for SARS-CoV-2 Genomes

To combat the current COVID-19 pandemic, scientists around the world are sequencing viral genomes at an accelerated pace.

These sequences are then being deposited into a number of international databases, including the National Center for Biotechnology Information (NCBI; Figure 1). There are limitations to this approach as multiple databases makes it challenging for a single researcher on their own to consolidate data from different sources. These data were generated and processed by different research groups at different institutions resulting in batch effects when amalgamated -- differences in signal across groups of viral genomes processed together that represent technical noise and not biological variation. The distributed nature of the data and the lack of uniformity in data generation and processing hinders the pace of scientific discovery. To accelerate discovery, we need to leverage the breadth of data available internationally which requires data consolidation from multiple sources and data “cleaning” to reduce technical artifacts introduced during data processing.

News

July 30, 2020

DNAstack Launches Genomic Data Explorer to Accelerate Research in COVID-19

DNAstack today announced COVID Cloud, an online destination for exploring one of the largest collections of viral genome sequences in the world.

As the global caseload of COVID-19 exceeds 16.5 million people in over 200 countries, scientists are racing to study the genetics of the virus that causes it, SARS-CoV-2, to inform the development of urgently needed diagnostics and treatments. COVID Cloud is a software solution created by DNAstack that connects and shares a large and growing number of viral genomes seen around the world combined with visualization and analytical tools for scientists to examine the molecular machinery of the virus as it continues to spread and evolve.

News

July 30, 2020

DNAstack Launches Genomic Data Explorer to Accelerate Research in COVID-19

DNAstack today announced COVID Cloud, an online destination for exploring one of the largest collections of viral genome sequences in the world.

As the global caseload of COVID-19 exceeds 16.5 million people in over 200 countries, scientists are racing to study the genetics of the virus that causes it, SARS-CoV-2, to inform the development of urgently needed diagnostics and treatments. COVID Cloud is a software solution created by DNAstack that connects and shares a large and growing number of viral genomes seen around the world combined with visualization and analytical tools for scientists to examine the molecular machinery of the virus as it continues to spread and evolve.

DNAstack today announced COVID Cloud, an online destination for exploring one of the largest collections of viral genome sequences in the world.

As the global caseload of COVID-19 exceeds 16.5 million people in over 200 countries, scientists are racing to study the genetics of the virus that causes it, SARS-CoV-2, to inform the development of urgently needed diagnostics and treatments. COVID Cloud is a software solution created by DNAstack that connects and shares a large and growing number of viral genomes seen around the world combined with visualization and analytical tools for scientists to examine the molecular machinery of the virus as it continues to spread and evolve.

“By sharing genetic data globally, we can mount a sort of digital immune response to help us defend against this and future outbreaks,” said Marc Fiume, CEO of DNAstack. “With COVID Cloud, we can help scientists take the best technologies in genomics, data sharing, cloud computing, and machine learning to the fight against COVID-19.”

COVID Cloud provides unified access to a globally representative repository of viral genomes, which is updated daily with new sequences from international biobanks. In order to reduce errors that arise when comparing datasets from multiple sources, DNAstack processes raw data using harmonized bioinformatics pipelines. These pipelines have been authored in the platform-agnostic Workflow Description Language and published as open source on Dockstore and Github, to promote reproducible science and community collaboration.

[caption id="attachment_2915" align="alignnone" width="800"]

Using Variants, researchers can search the entire catalog of mutations found in SARS-CoV-2 sequences[/caption]

Datasets are shared over an integrated set of APIs defined by the Global Alliance for Genomics & Health, providing a standards-compliant platform on which the community can build powerful integrations and applications. For example, all of the files in COVID Cloud are served over the GA4GH Data Repository Service, a vendor-neutral way of representing files, to streamline their use in downstream analytical environments such as Jupyter Notebooks, Microsoft Power BI, and DNAstack’s Workflow Execution Service.

COVID Cloud also gives scientists intuitive controls for interactive exploration of the data. Using the Sequences tool, users can search over the entire catalogue of genomics data and information about the original source, collection date, and geographic location. Beacon lets scientists look up the prevalence of specific genetic mutations, such as D614G, a variant that appears to make SARS-CoV-2 more transmissible. With Molecules, researchers can manipulate three-dimensional representations of proteins encoded by the viral genome, like the Spike protein, in order to understand their physical conformations and predict how genetic mutations and therapeutic interventions may impact their function.

[video width="350" mp4="https://dnastack.com/corporate/wp-content/uploads/2020/07/covid-cloud-molecules-iphone.mp4" loop="true" autoplay="true"][/video]

COVID Cloud is hosted by DNAstack as a free service deployed on Microsoft Azure. The software that powers COVID Cloud is available to license for sharing public or private collections of genomics and clinical data related to COVID-19 or other disease areas.

The development of COVID Cloud has been supported through feasibility funding of the Digital Technology Supercluster’s COVID-19 Program, which aims to improve the health and safety of Canadians and support Canada’s ability to address issues created by the COVID-19 outbreak.

About DNAstack

DNAstack’s mission is to improve the lives of millions of people by breaking down barriers to data sharing and discovery. DNAstack develops standards and technologies for scientists to more efficiently find, access, and analyze the world’s exponentially growing volumes of genomic and biomedical data. For additional support or partnership interest, please contact us by email to info@dnastack.com.

About Digital Technology Supercluster

The Digital Technology Supercluster solves some of industry’s and society’s biggest problems through Canadian-made technologies. We bring together private and public sector organizations of all sizes to address challenges facing Canada’s economic sectors including healthcare, natural resources, manufacturing, and transportation. Through this ‘collaborative innovation,’ the Supercluster helps to drive solutions better than any single organization could on its own. The Digital Technology Supercluster is led by industry leaders such as D-Wave, LifeLabs, LlamaZOO, Lululemon, MDA, Microsoft, Mosaic Forest Management, Sanctuary AI, Teck Resources Limited, TELUS, Terramera, and 1Qbit. Together, we work to position Canada as a global hub for digital innovation. A full list of Members can be found here.

Media Inquiries

For DNAstack media inquiries: Christine Beyaert, christine@dnastack.com

For Digital Technology Supercluster related media inquiries: Elysa Darling, elysa@switchboardpr.com

News

July 30, 2020

DNAstack Launches Genomic Data Explorer to Accelerate Research in COVID-19

DNAstack today announced COVID Cloud, an online destination for exploring one of the largest collections of viral genome sequences in the world.

As the global caseload of COVID-19 exceeds 16.5 million people in over 200 countries, scientists are racing to study the genetics of the virus that causes it, SARS-CoV-2, to inform the development of urgently needed diagnostics and treatments. COVID Cloud is a software solution created by DNAstack that connects and shares a large and growing number of viral genomes seen around the world combined with visualization and analytical tools for scientists to examine the molecular machinery of the virus as it continues to spread and evolve.

News

July 30, 2020

DNAstack Launches Genomic Data Explorer to Accelerate Research in COVID-19

DNAstack today announced COVID Cloud, an online destination for exploring one of the largest collections of viral genome sequences in the world.

As the global caseload of COVID-19 exceeds 16.5 million people in over 200 countries, scientists are racing to study the genetics of the virus that causes it, SARS-CoV-2, to inform the development of urgently needed diagnostics and treatments. COVID Cloud is a software solution created by DNAstack that connects and shares a large and growing number of viral genomes seen around the world combined with visualization and analytical tools for scientists to examine the molecular machinery of the virus as it continues to spread and evolve.

News

July 13, 2020

SARS-CoV-2: Biology Origins, and How Open Science is Accelerating the Search for Therapeutic Answers

DNAstack's bioinformatician Heather Ward breaks down the biology of the novel coronavirus responsible for the COVID-19 outbreak.

News

July 13, 2020

SARS-CoV-2: Biology Origins, and How Open Science is Accelerating the Search for Therapeutic Answers

DNAstack's bioinformatician Heather Ward breaks down the biology of the novel coronavirus responsible for the COVID-19 outbreak.

Introduction

Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) is the novel coronavirus responsible for the COVID-19 outbreak that first emerged in early December 2019 in Wuhan, China. As of March 20, 2020 SARS-CoV-2 has resulted in nearly 250,000 cases worldwide, claiming the lives of over 10,000 people.

Here, I’ll briefly break down the potential origins and viral life cycle of SARS-CoV-2, how it differs from the virus responsible for the 2002 outbreak, and how genomics and open science can be used to explore and develop therapeutics that will help mitigate this global threat.

SARS-CoV-2 and related coronaviruses

SARS-CoV-2 is a coronavirus, members of a class of positive-sense single-stranded RNA (ssRNA) viruses so named due to their resemblance to solar coronas. Other ssRNA viruses cause diseases which range in severity, including HIV, West Nile, and the common cold.

There are several coronaviruses known to infect humans, with the most well-known being SARS-CoV (responsible for the 2002 outbreak) and MERS-CoV (Middle Eastern Respiratory Syndrome Coronavirus). Both of these coronaviruses, as well as the current SARS-CoV-2, are believed to have originated in bats, which act as a natural reservoir for a number of coronaviruses. The virus is postulated to pass to humans via an intermediary host (civet cats in the case of SARS-CoV, and dromedary camels for MERS-CoV). Several potential hosts have been suggested as the intermediary for the current SARS-CoV-2, including snakes and pangolins.

It’s important to note that the majority of these bat-endemic coronaviruses are not able to infect humans, and mutation is required for a coronavirus to be able to transition to a new host organism. To obtain insight into which parts of the genome require mutation to allow a virus the ability to target a new host first requires an understanding of the basics of the coronavirus viral life cycle.

[caption id="attachment_3926" align="aligncenter" width="701"]

Figure 1: SARS-CoV-2 virion. [/caption]

The SARS-CoV-2 viral life cycle

The major steps of the viral life cycle of SARS-CoV-2 as well as other coronaviruses include:

Binding of the virus to a receptor on a target host cell
Membrane fusion between the viral envelope and the host cell, which releases the viral genome into the host cell
Replication of the viral genome
Transcription and translation of viral structural proteins
Assembly and export of mature virions

Mature virions (packaged viral particles including the viral genome and structural proteins, see the SARS-CoV-2 virion pictured in figure 1) released from an infected host cell may infect other cells and continue the infection cycle.

If virions are unable to bind to host cell receptors or if membrane fusion does not occur, infection will not take place. These key steps are both mediated by a particular viral protein — the spike protein.

The Spike protein

The spike protein is a homotrimeric (made up of three identical peptides) transmembrane protein found studded around the exterior of the mature virion. Each monomer (one of the three identical peptides) is comprised of two subunits: the S1 subunit, which is responsible for recognizing and binding to a host cell receptor, and the S2 subunit, which facilitates membrane fusion and release of the viral genome into the host cell (see figure 2).

Because the virus can only infect host cells that it is able to bind to, the S1 subunit of the spike protein is responsible for host specificity — the range of hosts that the virus is able to infect. In order for a virus to be able to infect a new organism — e.g. in the transition between bat and human hosts — the receptor binding domain of the S1 subunit must gain the ability to bind to a receptor found in that new host. In both SARS-CoV and SARS-CoV-2, the human receptor appears to be the protein angiotensin converting enzyme 2 (ACE2), which is found on the surface of cells in the human respiratory tract. Interestingly, despite targeting the same receptor protein, many of the key amino acids that interact with the ACE2 receptor and that were previously thought to be essential for binding to ACE2 appear to be almost completely distinct between the SARS-CoV and SARS-CoV-2 receptor binding domains, implying that specificity for the same receptor may have evolved independently in each strain.

[caption id="attachment_3928" align="aligncenter" width="587"]

Figure 2: Structure of the SARS-CoV spike protein monomer (blue and green) bound to the ACE2 receptor (yellow). The spike protein is comprised of the S1 (blue) and S2 (green) subunits. S1/S2 and S2' cleavage sites are labelled in red. Generated using open-source PyMOL™ from the cryo-EM structure.[/caption]

Activation of the spike protein following receptor binding

Receptor binding alone is not sufficient for viral infection. Binding initiates conformational changes in the spike protein that lead to membrane fusion and infection, but another step is required before fusion can take place: cleavage of the spike protein.

There are at least two cleavage sites on the spike protein that must be cut prior to viral entry; one between the S1 and S2 subunits (S1/S2 site) and one internal to the S2 subunit (S2' site) (see figure 2, red). Cleavage at the S1/S2 site primes the protein and leads to cleavage of the S2' site, which is necessary for membrane fusion. The specific proteases (proteins that cut other proteins) that are able to perform the cleavage steps depend on the amino acid sequence that is present at each cleavage site; in many cases, several different proteases are able to cut the same site with greater or lesser efficiency.

Similar to the host-specificity of the receptor binding domain, if cleavage sites are not recognized by host proteases, cleavage and therefore infection will not be able to occur in that host. This means that both a receptor binding domain that recognizes a host target as well as cleavage sites that can be cut by host proteases are required for transmission of the virus to a novel host. For example, some bat coronaviruses have been found that are able to bind to human proteins but fail to initiate infection because their spike protein is not cleaved in human hosts.

A novel cleavage site on SARS-CoV-2

In SARS-CoV-2, a novel cleavage site has been discovered at the S1/S2 junction which is cleaved by a ubiquitous human protease known as furin. The inclusion of this novel furin site allows the SARS-CoV-2 spike protein to be cleaved during biosynthesis — this means that the protein is ‘primed’ even prior to release of the virion from the host cell. This is in contrast to the spike protein produced by SARS-CoV, which lacks this site and is released from the cell intact, requiring later cleavage before it can facilitate membrane fusion.

It is unclear whether priming during biosynthesis has an impact on viral infectivity; a 2006 study by Follis et al. found that the introduction of a furin cleavage site into SARS-CoV’s spike protein at the S1/S2 junction resulted in enhanced membrane fusion between virus and host, but could find no evidence for an accompanying increase in infectivity. It remains to be seen how the novel furin site in SARS-CoV-2 will impact its infectivity and spread.

A key target for therapeutic agents

Researchers across the globe are searching the SARS-CoV-2 genome for features that will allow it to be targeted by therapeutic agents. Due to the nature of the spike protein and its fundamental role in mediating host specificity and viral infection, it represents an attractive target for the development of therapeutic agents. In particular, mechanisms targeting receptor binding, proteolytic cleavage, and membrane fusion may prove effective in attenuating the virus’s ability to infect human cells. Due to the genetic similarity between the novel SARS-CoV-2 and SARS-CoV, including their shared receptor target, it is possible that agents shown to be effective against SARS-CoV may also prove effective at slowing SARS-CoV-2.

SARS-CoV-2 Research

The swift response of researchers worldwide to study SARS-CoV-2 and to share sequencing data publicly has allowed for rapid insights into key genetic features that will prove indispensable in the days and months to come. This tremendous, coordinated global effort to elucidate the origins and mechanisms of the virus could not have been accomplished without the aid of modern technologies allowing researchers to share data quickly across geopolitical borders. This reaffirms the essential role of technology in facilitating science, especially in the ability to respond quickly to global emergencies.

To that end, DNAstack has developed a beacon for SARS-CoV-2 where users can explore aggregated genetic variants discovered by labs worldwide. Explore it here: covid-19.dnastack.com.

About the Author

Heather is part of the Data Science Team at DNAstack, where she authors, tests, and runs analytical pipelines for internal and customer projects

References and Further Reading

Belouzard, S., Chu, V.C. and Whittaker, G.R. 2009. Activation of the SARS coronavirus spike protein via sequential proteolytic cleavage at two distinct sites. PNAS. 106(14): 5871–5876.
Chan, J.F.W., Kok, K-H., Zhu, Z., Chu, H., To, K. K-W., Yuan, S. and Yuen, K-Y. 2020. Genomic characterization of the 2019 novel human-pathogenic coronavirus isolated from a patient with atypical pneumonia after visiting Wuhan. Emerging Microbes & Infections. 9: 221–246.
Coutard, B., Valle, C., de Lamballerie, X., Canard, B., Seidah, N.G. and Decroly, E. 2020. The spike glycoprotein of the new coronavirus 2019-nCoV contains a furin-like cleavage site absent in CoV of the same clade. Antiviral research. 176: 104742.
Gong, S. and Bao, L-L. 2018. The battle against SARS and MERS coronaviruses: Reservoirs and animal models. Animal Model Exp Med. 1:125–133.
Follis, K.E., York, J. and Numberg, J.H. 2006. Furin cleavage of the SARS coronavirus spike glycoprotein enhances cell-cell fusion but does not affect virion entry. Virology. 350:358–369.
Millet, J.K. and Whittaker, G.R. 2015. Host cell proteases: critical determinants of coronavirus tropism and pathogenesis. Virus Research. 202: 120–134.
Racaniello, V. Furin cleavage site in the SARS-CoV-2 coronavirus glycoprotein. Virology blog. http://www.virology.ws/2020/02/13/furin-cleavage-site-in-the-sars-cov-2-coronavirus-glycoprotein/. Published February 13, 2020. Accessed March 10, 2020.
Song, W., Gui, M., Wang, X. and Xiang, Y. 2018. Cryo-EM structure of the SARS coronavirus spike glycoprotein in complex with its host cell receptor ACE2. PLOS Pathogens. https://doi.org/10.1371/journal.ppat.1007236
Walls, A.C., Park, Y-J., Tortorici, M.J., Wall, A., McGuire, A.T. and Veesler, D. 2020. Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein. Cell. 180: 1–12.
Wong, M.C., Cregeen, S.J., Ajami, N.J. and Petrosino, J.F. 2020 (preprint). Evidence of recombination in coronaviruses implicating pangolin origins of nCoV-2019. bioRxiv, preprint. https://doi.org/10.1101/2020.02.07.939207
Xia, S., Zhu, Y., Liu, M., Lan, Q., Xu, W., Wu, Y., Ying, T., Liu, S., Shi, Z., Jiang, S. and Lu, L. 2020. Fusion mechanism of 2019-nCoV and fusion inhibitors targeting HR1 domain in spike protein. Cellular & Molecular Immunology. https://doi.org/10.1038/s41423-020-0374-2
Xu, X., Chen, P., Wang, J., Feng, J., Zhou, H., Li, X., Zhong, W. and Hao, P. 2020. Evolution of the novel coronavirus from the ongoing Wuhan outbreak and modeling of its spike protein for risk of human transmission. Science China Life Sciences. 63(3): 457–460.

About DNAstack

DNAstack’s mission is to improve the lives of millions of people by breaking down barriers to data sharing and discovery. DNAstack develops standards and technologies for scientists to more efficiently find, access, and analyze the world’s exponentially growing volumes of genomic and biomedical data. For additional support or partnership interest, please contact us by email to info@dnastack.com.

Photo Credits

Figure 1: CDC/Alissa Eckert, MS; Dan Higgins, MAMSFigure 2: Song et al., 2018; PDB accession 6ACK.

News

July 13, 2020

SARS-CoV-2: Biology Origins, and How Open Science is Accelerating the Search for Therapeutic Answers

DNAstack's bioinformatician Heather Ward breaks down the biology of the novel coronavirus responsible for the COVID-19 outbreak.

News

July 13, 2020

SARS-CoV-2: Biology Origins, and How Open Science is Accelerating the Search for Therapeutic Answers

DNAstack's bioinformatician Heather Ward breaks down the biology of the novel coronavirus responsible for the COVID-19 outbreak.

Podcast

May 11, 2020

Omics Xchange Podcast - COVID-19 Beacon: an Interview with Marc Fiume

Podcast

May 11, 2020

Omics Xchange Podcast - COVID-19 Beacon: an Interview with Marc Fiume

Podcast

May 11, 2020

Omics Xchange Podcast - COVID-19 Beacon: an Interview with Marc Fiume

Podcast

May 11, 2020

Omics Xchange Podcast - COVID-19 Beacon: an Interview with Marc Fiume

Press Releases

May 4, 2020

Software Tool Built by U of T Startup Shares Genetic Data with COVID-19 Researchers Around the World

Co-founded by U of T alumnus Marc Fiume, DNAstack launched a search engine aimed at the global research community that scans and indexes genomic information about the novel coronavirus.

Press Releases

May 4, 2020

Software Tool Built by U of T Startup Shares Genetic Data with COVID-19 Researchers Around the World

Co-founded by U of T alumnus Marc Fiume, DNAstack launched a search engine aimed at the global research community that scans and indexes genomic information about the novel coronavirus.

Press Releases

May 4, 2020

Software Tool Built by U of T Startup Shares Genetic Data with COVID-19 Researchers Around the World

Co-founded by U of T alumnus Marc Fiume, DNAstack launched a search engine aimed at the global research community that scans and indexes genomic information about the novel coronavirus.

Press Releases

May 4, 2020

Software Tool Built by U of T Startup Shares Genetic Data with COVID-19 Researchers Around the World

Co-founded by U of T alumnus Marc Fiume, DNAstack launched a search engine aimed at the global research community that scans and indexes genomic information about the novel coronavirus.

Video

April 21, 2020

Prime Minister Justin Trudeau Names DNAstack in His Address Highlighting Role as a Key Canadian Innovator in the Fight Against COVID-19

Video

April 21, 2020

Prime Minister Justin Trudeau Names DNAstack in His Address Highlighting Role as a Key Canadian Innovator in the Fight Against COVID-19

Video

April 21, 2020

Prime Minister Justin Trudeau Names DNAstack in His Address Highlighting Role as a Key Canadian Innovator in the Fight Against COVID-19

Video

April 21, 2020

Prime Minister Justin Trudeau Names DNAstack in His Address Highlighting Role as a Key Canadian Innovator in the Fight Against COVID-19

Press Releases

April 20, 2020

Using Innovation to Protect Canadians and Secure Our Economy

Canada's Digital Technology Supercluster is supporting Canadians in the fight against COVID-19.

Press Releases

April 20, 2020

Using Innovation to Protect Canadians and Secure Our Economy

Canada's Digital Technology Supercluster is supporting Canadians in the fight against COVID-19.

Press Releases

April 20, 2020

Using Innovation to Protect Canadians and Secure Our Economy

Canada's Digital Technology Supercluster is supporting Canadians in the fight against COVID-19.

Press Releases

April 20, 2020

Using Innovation to Protect Canadians and Secure Our Economy

Canada's Digital Technology Supercluster is supporting Canadians in the fight against COVID-19.

Press Releases

March 20, 2020

DNAstack Launches COVID-19 Beacon to Accelerate Sharing Genomic Data in the Fight Against Novel Coronavirus

DNAstack today introduced a Beacon for SARS-CoV-2, commonly known as COVID-19, available at covid-19.dnastack.com.

Press Releases

March 20, 2020

DNAstack Launches COVID-19 Beacon to Accelerate Sharing Genomic Data in the Fight Against Novel Coronavirus

DNAstack today introduced a Beacon for SARS-CoV-2, commonly known as COVID-19, available at covid-19.dnastack.com.

Press Releases

March 20, 2020

DNAstack Launches COVID-19 Beacon to Accelerate Sharing Genomic Data in the Fight Against Novel Coronavirus

DNAstack today introduced a Beacon for SARS-CoV-2, commonly known as COVID-19, available at covid-19.dnastack.com.

Press Releases

March 20, 2020

DNAstack Launches COVID-19 Beacon to Accelerate Sharing Genomic Data in the Fight Against Novel Coronavirus

DNAstack today introduced a Beacon for SARS-CoV-2, commonly known as COVID-19, available at covid-19.dnastack.com.

Press Releases

November 11, 2019

Supercluster Helps DNAstack Keep up with its Ambitions

Toronto company is ‘getting noticed’ as it works to build the digital infrastructure to power the next generation of scientific research.

Press Releases

November 11, 2019

Supercluster Helps DNAstack Keep up with its Ambitions

Toronto company is ‘getting noticed’ as it works to build the digital infrastructure to power the next generation of scientific research.

Press Releases

November 11, 2019

Supercluster Helps DNAstack Keep up with its Ambitions

Toronto company is ‘getting noticed’ as it works to build the digital infrastructure to power the next generation of scientific research.

Press Releases

November 11, 2019

Supercluster Helps DNAstack Keep up with its Ambitions

Toronto company is ‘getting noticed’ as it works to build the digital infrastructure to power the next generation of scientific research.

News

October 14, 2019

DNAstack Launches Clinical Evidence Beacons to Drive Crowdsourcing for Genetic Disease Discovery

News

October 14, 2019

DNAstack Launches Clinical Evidence Beacons to Drive Crowdsourcing for Genetic Disease Discovery

The  Beacon Network , where new Clinical Evidence Beacons can be searched for crowdsourcing classification of genomic variants.

DNAstack today announced the launch of Clinical Evidence Beacons on the Beacon Network, a real-time search engine for finding genetic mutations across a global network of genomic datasets.These additions will enable medical laboratories to crowdsource the interpretation of variants through a secure social network.Accurately interpreting DNA variants identified through genetic testing is essential for patients and clinicians to make informed medical decisions, for a growing number of medical use cases. While some of those variants can be confidently predicted to be pathogenic or benign based on previous studies and data accessible through variant interpretation resources, in many cases evidence is missing or inconsistent, resulting in conflicting evaluations or reporting as “variants of unknown significance” (VUS). Clinical Evidence Beacons facilitate faster and more consistent variant classifications by securely sharing variant interpretation evidence between collaborating organizations, accelerating the exchange of critical knowledge and improving support for patients affected by genetic diseases and carriers of variants that have an impact on medical decision making. The Beacon Network, where new Clinical Evidence Beacons can be searched for crowdsourcing classification of genomic variants.Building upon the Global Alliance for Genomic and Health (GA4GH) Beacon API, an open standard that allows researchers to determine whether a given variant exists within a genomic dataset, Clinical Evidence Beacons are an extension of the protocol being piloted by DNAstack, allowing uncurated knowledge about a variant to be shared and discovered in real time.“The Beacon API 1.0, which was approved as a GA4GH standard last year, validates the international community’s willingness to work together to define standards and engage in data sharing in a meaningful way,” said Miro Cupak, VP Engineering at DNAstack. “The original protocol was intentionally simple. We’ve since been exploring more powerful derivatives, and learned that by integrating clinical data in the payload of the Beacon that we can help solve outstanding issues faced by the clinical genomics community. Clinical Evidence Beacons, while not formally approved as a GA4GH standard, demonstrate one potential application being designed for future versions of the protocol.”

Miro Cupak, VP Engineering at DNAstackThe first Clinical Evidence Beacons to join the Beacon Network come from the Canadian Open Genetics Repository (COGR), a network of over 20 laboratories that have come together to share information about variants and clinical cases. Members of the COGR are now able to search controlled access Clinical Evidence Beacons for variants of interest on the Beacon Network. A national effort to improve the quality of variant classification has been led by the COGR, who previously published that the percentage of variants with discordant classifications dropped from 26.7% to 14.2% as a result of crowdsourcing, demonstrating the power of collaboration between clinical genomics labs.“Our understanding of genetic data continues to evolve and there is often not a one-to-one correlation between genetic variation and disease, so international or global data sharing efforts are vital to moving the field forward,” said Dr. Jordan Lerner-Ellis, principle investigator of the COGR and Head of Advanced Molecular Diagnostics at Toronto’s Mount Sinai Hospital, Sinai Health System and Associate Professor at the University of Toronto. “Systems that allow for easily accessible real-time data sharing will be increasingly important to be able to provide the most up-to-date information and to translate it into patient care.”While the cost of genome sequencing has decreased significantly, it is still costly. Software that enables valuable biomedical data to be shared will enable future healthcare systems to draw on distributed collections of data in real time, unconstrained by traditional institutional silos and long publication cycles. As we move toward personalized healthcare, there is a need for such systems to integrate genomic information, clinical data, and real-world evidence to better inform treatment decisions.“There is an enormous need to share genomic information and we have seen worldwide interest in the application of Beacons in healthcare environments,” said Jordi Rambla, European Genome-phenome Archive (EGA) Team Lead at the Centre for Genomic Regulation (CRG). “Working with the clinical community, we are pioneering ideas to improve upon Beacon 1.0, and using this knowledge and experience to inform the next version of this standard. Ultimately, Clinical Evidence Beacons could make sharing genomic information, as well as phenotypic data, easier, faster, and more securely than is possible today, accelerating knowledge exchange, diagnoses, and improvements to patient care.”

Jordi Rambla, European Genome-phenome Archive Team Lead at the Centre for Genomic RegulationWhile DNAstack’s support for clinical use cases has been developed as extensions to the current Beacon version, an international effort coordinated by the GA4GH Discovery Work Stream and lead by the ELIXIR Beacon Project currently prepares a major upgrade of the GA4GH Beacon protocol. Since the roadmap of the next version includes changes supporting a variety of stakeholder defined biomedical use cases, incorporation of the upcoming Beacon protocol into software supported by DNAstack will accelerate future applications for genomic variant research and discovery.About DNAstackDNAstack’s mission is to improve the lives of millions of people affected by genetic disease by breaking down barriers to data sharing and discovery. DNAstack develops standards and technologies for scientists to more efficiently find, access, and analyze the world’s exponentially growing volumes of genomic and biomedical data.Photo CreditsCape Canaveral Air Force Station, United States. Photo Credit: *SpaceX*

News

October 14, 2019

DNAstack Launches Clinical Evidence Beacons to Drive Crowdsourcing for Genetic Disease Discovery

News

October 14, 2019

DNAstack Launches Clinical Evidence Beacons to Drive Crowdsourcing for Genetic Disease Discovery

Press Releases

March 4, 2019

Federated Discovery and Sharing of Genomic Data Using Beacons

Press Releases

March 4, 2019

Federated Discovery and Sharing of Genomic Data Using Beacons

Here we describe the Beacon protocol and how it can be used as a model for the federated discovery and sharing of genomic data.

Press Releases

March 4, 2019

Federated Discovery and Sharing of Genomic Data Using Beacons

Press Releases

March 4, 2019

Federated Discovery and Sharing of Genomic Data Using Beacons

Press Releases

November 27, 2018

Canada's Digital Technology Supercluster Officially Launches with $153M in Funding from ISED

The Government of Canada is investing up to $950 million over five years to support industry-led innovation superclusters across the country and accelerate economic growth, productivity, and competitiveness across five Superclusters.

Press Releases

November 27, 2018

Canada's Digital Technology Supercluster Officially Launches with $153M in Funding from ISED

The Government of Canada is investing up to $950 million over five years to support industry-led innovation superclusters across the country and accelerate economic growth, productivity, and competitiveness across five Superclusters.

Press Releases

November 27, 2018

Canada's Digital Technology Supercluster Officially Launches with $153M in Funding from ISED

The Government of Canada is investing up to $950 million over five years to support industry-led innovation superclusters across the country and accelerate economic growth, productivity, and competitiveness across five Superclusters.

Press Releases

November 27, 2018

Canada's Digital Technology Supercluster Officially Launches with $153M in Funding from ISED

The Government of Canada is investing up to $950 million over five years to support industry-led innovation superclusters across the country and accelerate economic growth, productivity, and competitiveness across five Superclusters.

Press Releases

October 24, 2018

Genomic Data Interoperability, Remote Workflow Key to New Global Alliance APIs

Newly released APIs are the first products from the Global Alliance for Genomics and Health's strategic roadmap for interoperability of genomic data.

Press Releases

October 24, 2018

Genomic Data Interoperability, Remote Workflow Key to New Global Alliance APIs

Newly released APIs are the first products from the Global Alliance for Genomics and Health's strategic roadmap for interoperability of genomic data.

Press Releases

October 24, 2018

Genomic Data Interoperability, Remote Workflow Key to New Global Alliance APIs

Newly released APIs are the first products from the Global Alliance for Genomics and Health's strategic roadmap for interoperability of genomic data.

Press Releases

October 24, 2018

Genomic Data Interoperability, Remote Workflow Key to New Global Alliance APIs

Newly released APIs are the first products from the Global Alliance for Genomics and Health's strategic roadmap for interoperability of genomic data.

Press Releases

October 11, 2018

ClinGen Advancing Genomic Data‐Sharing Standards as a GA4GH Driver Project

ClinGen has joined with the Global Alliance for Genomics and Health (GA4GH) to support the development of open, freely‐available technical standards and regulatory frameworks for secure and responsible sharing of genomic and health‐related data.

Press Releases

October 11, 2018

ClinGen Advancing Genomic Data‐Sharing Standards as a GA4GH Driver Project

ClinGen has joined with the Global Alliance for Genomics and Health (GA4GH) to support the development of open, freely‐available technical standards and regulatory frameworks for secure and responsible sharing of genomic and health‐related data.

Press Releases

October 11, 2018

ClinGen Advancing Genomic Data‐Sharing Standards as a GA4GH Driver Project

ClinGen has joined with the Global Alliance for Genomics and Health (GA4GH) to support the development of open, freely‐available technical standards and regulatory frameworks for secure and responsible sharing of genomic and health‐related data.

Press Releases

October 11, 2018

ClinGen Advancing Genomic Data‐Sharing Standards as a GA4GH Driver Project

ClinGen has joined with the Global Alliance for Genomics and Health (GA4GH) to support the development of open, freely‐available technical standards and regulatory frameworks for secure and responsible sharing of genomic and health‐related data.

Video

October 4, 2018

Beacon: The Story so Far

Video

October 4, 2018

Beacon: The Story so Far

Video

October 4, 2018

Beacon: The Story so Far

Video

October 4, 2018

Beacon: The Story so Far

Press Releases

August 9, 2018

Canadian Precision Health Infrastructure Emphasizes Secure Data Sharing, Privacy, Consent

Press Releases

August 9, 2018

Canadian Precision Health Infrastructure Emphasizes Secure Data Sharing, Privacy, Consent

Project partners will expand on infrastructure developed by DNAstack for accessing genomic data and explore patient consent models that support nationwide sharing.

Press Releases

August 9, 2018

Canadian Precision Health Infrastructure Emphasizes Secure Data Sharing, Privacy, Consent

Press Releases

August 9, 2018

Canadian Precision Health Infrastructure Emphasizes Secure Data Sharing, Privacy, Consent

Press Releases

August 2, 2018

Registered Access: Authorizing Data Access

The Global Alliance for Genomics and Health (GA4GH) proposes a data access policy model—“registered access”—to increase and improve access to data requiring an agreement to basic terms and conditions, such as the use of DNA sequence and health data in research.

Press Releases

August 2, 2018

Registered Access: Authorizing Data Access

The Global Alliance for Genomics and Health (GA4GH) proposes a data access policy model—“registered access”—to increase and improve access to data requiring an agreement to basic terms and conditions, such as the use of DNA sequence and health data in research.

Press Releases

August 2, 2018

Registered Access: Authorizing Data Access

The Global Alliance for Genomics and Health (GA4GH) proposes a data access policy model—“registered access”—to increase and improve access to data requiring an agreement to basic terms and conditions, such as the use of DNA sequence and health data in research.

Press Releases

August 2, 2018

Registered Access: Authorizing Data Access

The Global Alliance for Genomics and Health (GA4GH) proposes a data access policy model—“registered access”—to increase and improve access to data requiring an agreement to basic terms and conditions, such as the use of DNA sequence and health data in research.

News

August 1, 2018

DNAstack to Co-Develop a National Platform for Precision Health Through Canada’s Digital Technology Supercluster

News

August 1, 2018

DNAstack to Co-Develop a National Platform for Precision Health Through Canada’s Digital Technology Supercluster

DNAstack today announced its participation in a new project to accelerate the development of a national software platform for precision health in Canada.

The project — in which Deloitte, Genome BC, LifeLabs, Microsoft, Molecular You, Provincial Health Services Authority, and the University of British Columbia will also participate — is among the first to be selected and launched as part of Canada’s Digital Technology Supercluster, a federally funded program that recently received over $150M to stimulate the creation of competitive and innovative digital technology solutions for top industries.

With support from the Canadian government, the team is building a powerful new software platform that will make it easier for healthcare organizations, academic researchers, clinical laboratories, pharmaceutical companies, and other innovators to harness exponentially growing volumes of genomic and biomedical data. The platform will help drive new scientific discoveries and inform medical decisions, translating into more personalized and cost-effective healthcare for millions of Canadians.

The DNAstack team

The platform has been designed from the ground up around modern principles of data security, sharing, and analysis, and serves as an alternative path for organizations looking to avoid enormous, ongoing cost burdens associated with purchasing and maintaining local computational infrastructure. The platform is already being piloted with early adopters across the country, where it has proven to be dramatically more powerful, secure, cost-efficient, and accessible compared to other existing solutions. The project team aims to deliver the most advanced platform for precision health in the country, positioning healthcare organizations to roll out new programs that reap significant health and economic benefits for years to come.

“We’re laying the foundation for the future of genomic and biomedical science, where the combination of networked data and powerful technology is used to generate life-saving insights faster than ever before,” said Dr. Marc Fiume, CEO and Co-Founder at DNAstack. “With this platform, we’re empowering scientists to take big data, cloud computing, and machine learning to the fight against the biggest challenges in health.”

"We’re empowering scientists to take big data, cloud computing, and machine learning to the fight against the biggest challenges in health." — Marc Fiume, CEO at DNAstack

[caption id="" align="aligncenter" width="1400"] Dr. Marc Fiume, CEO and Co-Founder of DNAstack[/caption]

The platform will provide easy to use tools for data producers (e.g. principal investigators, diagnostics laboratories, hospital systems, patient advocacy groups, individuals) to connect and administer the secure sharing of their datasets, and for data consumers (e.g. academic, clinical, pharmaceutical, and industry researchers) to discover and analyze that data using both gold standard and custom applications. Individual users of the platform will be able to perform intense statistical and machine learning analyses with on-demand access to hundreds of thousands of compute cores, more than 10 times the computing power of some of the most equipped research institutions in Canada.

For DNAstack, the project is a continuation of years of global leadership and product innovation in the space. Since 2014, DNAstack has been an active member of the Global Alliance for Genomics & Health (GA4GH), where it contributes to the development of open standards for interoperable data sharing and analysis. This project is integrating key GA4GH protocols for identity, access, discovery, and analysis. In 2018, DNAstack co-founded the Canadian Genomics Cloud, the most computationally powerful public cloud platform for genomics and precision medicine in Canada, which is actively being used by leading scientists across the country to study the genetic causes of autism, adult cancer, pediatric cancer, heart disease, mental health, cystic fibrosis, and other rare diseases. DNAstack is now working in close collaboration with partners of the Digital Technology Supercluster, having diverse and complementary expertise, to introduce entirely new features to the market.

Bill Tam, Vice President of Business Development and Partner Relations, Canada’s Digital Technology Supercluster — [caption id="" align="aligncenter" width="300"] Bill Tam, VP of Business Development and Partner Relations, Canada’s Digital Technology Supercluste[/caption]

“We are supporting ambitious opportunities that can’t be tackled by one company alone. Through a collective effort, this project aims to make a global impact and position Canada as a world leader in health,” said Bill Tam, Vice President of Business Development and Partner Relations for Canada’s Digital Technology Supercluster. “We are proud that the Supercluster has created an elevated platform for leading Canadian SMEs like DNAstack to continue to innovate and grow.” — Bill Tam, Vice Presdient of Business Development and Partner Releations, Canada's Digital Technology Supercluster

News

August 1, 2018

DNAstack to Co-Develop a National Platform for Precision Health Through Canada’s Digital Technology Supercluster

News

August 1, 2018

DNAstack to Co-Develop a National Platform for Precision Health Through Canada’s Digital Technology Supercluster

Press Releases

June 25, 2018

DNAstack and Autism Speaks® Announce Collaboration to Accelerate Scientific Discovery on One of the World's Largest Autism Genome Databases

The Autism Speaks MSSNG project will help researchers answer the many remaining questions about the genetic underpinnings of autism by sequencing the DNA of over 10,000 families affected by autism.

Press Releases

June 25, 2018

DNAstack and Autism Speaks® Announce Collaboration to Accelerate Scientific Discovery on One of the World's Largest Autism Genome Databases

The Autism Speaks MSSNG project will help researchers answer the many remaining questions about the genetic underpinnings of autism by sequencing the DNA of over 10,000 families affected by autism.

Press Releases

June 25, 2018

DNAstack and Autism Speaks® Announce Collaboration to Accelerate Scientific Discovery on One of the World's Largest Autism Genome Databases

The Autism Speaks MSSNG project will help researchers answer the many remaining questions about the genetic underpinnings of autism by sequencing the DNA of over 10,000 families affected by autism.

Press Releases

June 25, 2018

DNAstack and Autism Speaks® Announce Collaboration to Accelerate Scientific Discovery on One of the World's Largest Autism Genome Databases

The Autism Speaks MSSNG project will help researchers answer the many remaining questions about the genetic underpinnings of autism by sequencing the DNA of over 10,000 families affected by autism.

Press Releases

March 14, 2018

Simplifying Research Access to Genomics and Health Data with Library Cards

The volume of genomics and health data is growing rapidly, driven by sequencing for both research and clinical use.

Press Releases

March 14, 2018

Simplifying Research Access to Genomics and Health Data with Library Cards

The volume of genomics and health data is growing rapidly, driven by sequencing for both research and clinical use.

Press Releases

March 14, 2018

Simplifying Research Access to Genomics and Health Data with Library Cards

The volume of genomics and health data is growing rapidly, driven by sequencing for both research and clinical use.

Press Releases

March 14, 2018

Simplifying Research Access to Genomics and Health Data with Library Cards

The volume of genomics and health data is growing rapidly, driven by sequencing for both research and clinical use.

Press Releases

February 26, 2018

GA4GH Releases 2018 Strategic Roadmap

The Global Alliance for Genomics and Health (GA4GH) has announced their Strategic Roadmap, which includes a series of more than two dozen deliverables to be launched in 2018 and developed over the next one to three years.

Press Releases

February 26, 2018

GA4GH Releases 2018 Strategic Roadmap

The Global Alliance for Genomics and Health (GA4GH) has announced their Strategic Roadmap, which includes a series of more than two dozen deliverables to be launched in 2018 and developed over the next one to three years.

Press Releases

February 26, 2018

GA4GH Releases 2018 Strategic Roadmap

The Global Alliance for Genomics and Health (GA4GH) has announced their Strategic Roadmap, which includes a series of more than two dozen deliverables to be launched in 2018 and developed over the next one to three years.

Press Releases

February 26, 2018

GA4GH Releases 2018 Strategic Roadmap

The Global Alliance for Genomics and Health (GA4GH) has announced their Strategic Roadmap, which includes a series of more than two dozen deliverables to be launched in 2018 and developed over the next one to three years.

Press Releases

February 18, 2018

Canadian Genomics Cloud to Develop GA4GH Compliant Precision Medicine Platform

DNAstack, Canada’s Genomics Enterprise, Google, the Centre of Genomics and Policy, and more announce the launch of the Canadian Genomics Cloud (CGC): a national cloud-based infrastructure for genomics initiatives to share data across Canada.

Press Releases

February 18, 2018

Canadian Genomics Cloud to Develop GA4GH Compliant Precision Medicine Platform

DNAstack, Canada’s Genomics Enterprise, Google, the Centre of Genomics and Policy, and more announce the launch of the Canadian Genomics Cloud (CGC): a national cloud-based infrastructure for genomics initiatives to share data across Canada.

Press Releases

February 18, 2018

Canadian Genomics Cloud to Develop GA4GH Compliant Precision Medicine Platform

DNAstack, Canada’s Genomics Enterprise, Google, the Centre of Genomics and Policy, and more announce the launch of the Canadian Genomics Cloud (CGC): a national cloud-based infrastructure for genomics initiatives to share data across Canada.

Press Releases

February 18, 2018

Canadian Genomics Cloud to Develop GA4GH Compliant Precision Medicine Platform

DNAstack, Canada’s Genomics Enterprise, Google, the Centre of Genomics and Policy, and more announce the launch of the Canadian Genomics Cloud (CGC): a national cloud-based infrastructure for genomics initiatives to share data across Canada.

Press Releases

February 9, 2018

Health and the Genome Puzzle: Mapping DNA Has Gotten Cheaper, But Do We Know How to Use the Data?

Not only was Michael Szego the ethics lead on the Personal Genome Project Canada — he was also a participant, agreeing to have his genome mapped and shared publicly.

Press Releases

February 9, 2018

Health and the Genome Puzzle: Mapping DNA Has Gotten Cheaper, But Do We Know How to Use the Data?

Not only was Michael Szego the ethics lead on the Personal Genome Project Canada — he was also a participant, agreeing to have his genome mapped and shared publicly.

Press Releases

February 9, 2018

Health and the Genome Puzzle: Mapping DNA Has Gotten Cheaper, But Do We Know How to Use the Data?

Not only was Michael Szego the ethics lead on the Personal Genome Project Canada — he was also a participant, agreeing to have his genome mapped and shared publicly.

Press Releases

February 9, 2018

Health and the Genome Puzzle: Mapping DNA Has Gotten Cheaper, But Do We Know How to Use the Data?

Not only was Michael Szego the ethics lead on the Personal Genome Project Canada — he was also a participant, agreeing to have his genome mapped and shared publicly.

Press Releases

February 5, 2018

The Personal Genome Project Canada: Findings from Whole Genome Sequences of the Inaugural 56 Participants

The Personal Genome Project Canada is a comprehensive public data resource that integrates whole genome sequencing data and health information.

Press Releases

February 5, 2018

The Personal Genome Project Canada: Findings from Whole Genome Sequences of the Inaugural 56 Participants

The Personal Genome Project Canada is a comprehensive public data resource that integrates whole genome sequencing data and health information.

Press Releases

February 5, 2018

The Personal Genome Project Canada: Findings from Whole Genome Sequences of the Inaugural 56 Participants

The Personal Genome Project Canada is a comprehensive public data resource that integrates whole genome sequencing data and health information.

Press Releases

February 5, 2018

The Personal Genome Project Canada: Findings from Whole Genome Sequences of the Inaugural 56 Participants

The Personal Genome Project Canada is a comprehensive public data resource that integrates whole genome sequencing data and health information.

News

November 14, 2017

DNAstack Inks Partnership with Sentieon to Offer Faster, Cheaper, More Consistent Bioinformatics in the Cloud

DNAstack, a cloud genomics company, today is announcing a partnership with Sentieon, an award-winning bioinformatics software company.

News

November 14, 2017

DNAstack Inks Partnership with Sentieon to Offer Faster, Cheaper, More Consistent Bioinformatics in the Cloud

DNAstack, a cloud genomics company, today is announcing a partnership with Sentieon, an award-winning bioinformatics software company.

Through this partnership, a suite of Sentieon’s algorithms will be made available for running through DNAstack’s Workflows app to deliver genomics data analyses pipelines in the cloud that are faster, cheaper, without downsampling, and 100% consistent while using identical mathematics as the industry standard best practice workflows.

Sentieon technologies won precision FDA’s Consistency Challenge as Top Overall Performance and Highest Reproducibility, and Truth Challenge for highest SNP Recall and INDEL Precision. In the most recent precisionFDA Hidden Treasures-Warm Up challenge, along with 36 other submissions, all Sentieons’ submissions caught all injected variants while using default parameters without any special filtering. Last year, Sentieon’s TNscope toolset also ranked #1 in the ICGC-TCGA DREAM Challenge for Somatic Mutation Calling for all 3 categories SNV, Indel, SV. “At DNAstack, our mission is to democratize access to genomics data and best-in-class technologies to analyze it at scale,” said Dr. Marc Fiume, CEO of DNAstack. “The addition of Sentieon’s algorithms to our Workflows marketplace lets anyone with an internet connection run their award-winning software simply and at any scale. From there, they can interpret results privately in the context of the large and growing global network on our platform.” Fiume also co-leads the Discovery Workstream of the Global Alliance for Genomics & Health, which develops industry standards for sharing of genomics data, tools, and services.

Sentieon’s DNAseq pipeline for germline FASTQ-to-VCF running on DNAstack takes around 5 hours and costs less than $15 for a 30X coverage Whole Genome Sequence. “Through the integration with DNAstack, Sentieon technologies will be made more accessible to a global community of scientists to help accelerate breakthrough discoveries and the implementation of precision medicine,” said Dr. Jun Ye, Sentieon’s CEO. “We look forward to leveraging the power and efficiencies of DNAstack’s platform to deliver Sentieon’s accurate and fast tools for read alignment and variant calling to DNAstack’s customers”

“This can serve a very large user base that needs a simple and cost-effective ‘sequencer-to-scientist’ solution,” said Fiume. “Especially as genomics becomes increasingly integrated with clinical care, we see tremendous long term value in having high-speed, low-cost, no downsampling, and 100% reproducible solutions for data analysis.”

About DNAstack

DNAstack develops a cloud-based platform for genomics data analysis and sharing. Through collaborations with Google, Broad Institute, and the Global Alliance for Genomics & Health, DNAstack provides push-button access to state-of-the-art technologies to help researchers, clinical laboratories, and pharmaceutical companies more quickly and cost-effectively make sense of the world’s exponentially accumulating genomics data and break down barriers to data sharing.

Direct any questions to info@dnastack.com.

About Sentieon

Sentieon develops highly optimized and accurate algorithms for bioinformatics applications, using the team’s expertise in algorithm, software, and system optimization. Sentieon is a team of professionals using accumulated expertise in modeling, optimization, machine learning, and high-performance computing, to enable precision data for precision medicine.

News

November 14, 2017

DNAstack Inks Partnership with Sentieon to Offer Faster, Cheaper, More Consistent Bioinformatics in the Cloud

DNAstack, a cloud genomics company, today is announcing a partnership with Sentieon, an award-winning bioinformatics software company.

News

November 14, 2017

DNAstack Inks Partnership with Sentieon to Offer Faster, Cheaper, More Consistent Bioinformatics in the Cloud

DNAstack, a cloud genomics company, today is announcing a partnership with Sentieon, an award-winning bioinformatics software company.

News

Advancing a new era of breakthrough biomedical discoveries

Consortium Secures $5.1M to Expand Genomics Platform for COVID Research

Consortium Secures $5.1M to Expand Genomics Platform for COVID Research

About DNAstack

About Digital Technology Supercluster

About the COVID-19 Program

Consortium Secures $5.1M to Expand Genomics Platform for COVID Research

Consortium Secures $5.1M to Expand Genomics Platform for COVID Research

Digital Technology Supercluster Makes $10 Million Investment, Rounding Out $60 Million COVID-19 Program

Digital Technology Supercluster Makes $10 Million Investment, Rounding Out $60 Million COVID-19 Program

Digital Technology Supercluster Makes $10 Million Investment, Rounding Out $60 Million COVID-19 Program

Digital Technology Supercluster Makes $10 Million Investment, Rounding Out $60 Million COVID-19 Program

Joint Genotyping 10K Whole Genome Sequences Using Sentieon on Google: Strategies for Analyzing Large Sample Sets

Joint Genotyping 10K Whole Genome Sequences Using Sentieon on Google: Strategies for Analyzing Large Sample Sets

The pipeline

The tools

Challenges and optimizations

Upstream Pipeline: FASTQ -> CRAM, GVCF

Downstream Pipeline: Joint Genotyping

Disk Size Requirements

Pipeline Runtime

Solutions

Splitting Up Joint-Genotyping By Region

Merging Joint-Genotyped Files by Chromosome

Sentieon's " Split_By_Sample" Option For Large VCF'S

Running VQSR On A Large DataSet

References

Photo Credits

Microscopic image of crystallized DNA from autism genes. Photo Credit: Bianca Guimarae

Joint Genotyping 10K Whole Genome Sequences Using Sentieon on Google: Strategies for Analyzing Large Sample Sets

Joint Genotyping 10K Whole Genome Sequences Using Sentieon on Google: Strategies for Analyzing Large Sample Sets

DNAstack Showcases Standards-Based System For Federated Discovery and Analysis in GA4GH 2020 Connection Demos

DNAstack Showcases Standards-Based System For Federated Discovery and Analysis in GA4GH 2020 Connection Demos

About DNAstack

DNAstack Showcases Standards-Based System For Federated Discovery and Analysis in GA4GH 2020 Connection Demos

DNAstack Showcases Standards-Based System For Federated Discovery and Analysis in GA4GH 2020 Connection Demos

COVID-19 Researchers Get a Boost From AI-Powered Genomics Cloud

COVID-19 Researchers Get a Boost From AI-Powered Genomics Cloud

COVID-19 Researchers Get a Boost From AI-Powered Genomics Cloud

COVID-19 Researchers Get a Boost From AI-Powered Genomics Cloud

Harmonized Variant Calling for SARS-CoV-2 Genomes

Harmonized Variant Calling for SARS-CoV-2 Genomes

SARS-CoV-2 variant detection

Running the workflow

Nanopore variant calling

Illumina paired-end variant calling

Resources

References and Further Reading

Harmonized Variant Calling for SARS-CoV-2 Genomes

Harmonized Variant Calling for SARS-CoV-2 Genomes

DNAstack Launches Genomic Data Explorer to Accelerate Research in COVID-19

DNAstack Launches Genomic Data Explorer to Accelerate Research in COVID-19

About DNAstack

About Digital Technology Supercluster

Media Inquiries

DNAstack Launches Genomic Data Explorer to Accelerate Research in COVID-19

DNAstack Launches Genomic Data Explorer to Accelerate Research in COVID-19

SARS-CoV-2: Biology Origins, and How Open Science is Accelerating the Search for Therapeutic Answers

SARS-CoV-2: Biology Origins, and How Open Science is Accelerating the Search for Therapeutic Answers

Introduction

SARS-CoV-2 and related coronaviruses

The SARS-CoV-2 viral life cycle

The Spike protein

Activation of the spike protein following receptor binding

A novel cleavage site on SARS-CoV-2

A key target for therapeutic agents

SARS-CoV-2 Research

About the Author

References and Further Reading

About DNAstack

Photo Credits

SARS-CoV-2: Biology Origins, and How Open Science is Accelerating the Search for Therapeutic Answers

SARS-CoV-2: Biology Origins, and How Open Science is Accelerating the Search for Therapeutic Answers

Omics Xchange Podcast - COVID-19 Beacon: an Interview with Marc Fiume

Omics Xchange Podcast - COVID-19 Beacon: an Interview with Marc Fiume

Omics Xchange Podcast - COVID-19 Beacon: an Interview with Marc Fiume

Omics Xchange Podcast - COVID-19 Beacon: an Interview with Marc Fiume

Software Tool Built by U of T Startup Shares Genetic Data with COVID-19 Researchers Around the World

Software Tool Built by U of T Startup Shares Genetic Data with COVID-19 Researchers Around the World