For media requests, please contact info@dnastack.com.
A consortium of Canadian informatics firms, pharmaceutical companies, and research institutes has pledged C$5.1 million to tailor a bioinformatics and health data platform for COVID-19 research across the country.
A consortium of Canadian informatics firms, pharmaceutical companies, and research institutes has pledged C$5.1 million to tailor a bioinformatics and health data platform for COVID-19 research across the country.
A consortium of Canadian informatics firms, pharmaceutical companies, and research institutes has pledged C$5.1 million to tailor a bioinformatics and health data platform for COVID-19 research across the country.
A consortium of Canadian informatics firms, pharmaceutical companies, and research institutes has pledged C$5.1 million to tailor a bioinformatics and health data platform for COVID-19 research across the country.
A consortium of Canadian informatics firms, pharmaceutical companies, and research institutes has pledged C$5.1 million to tailor a bioinformatics and health data platform for COVID-19 research across the country.
A national consortium led by DNAstack will expand development of a software platform for genomics and health data and apply it to COVID-19.
The $5.1M project, called COVID Cloud, is co-funded by Canada’s Digital Technology Supercluster and aims to increase Canada’s capacity to harness exponentially growing volumes of genomics and biomedical data to advance precision health. The platform will be used by data scientists and domain experts to help understand, predict, and treat COVID-19 with molecular precision. With a global death count of over 1.4 million people and record numbers of cases nationally, solutions that can help Canada respond to ongoing challenges of the pandemic are urgently needed.
A national consortium led by DNAstack will expand development of a software platform for genomics and health data and apply it to COVID-19.
The $5.1M project, called COVID Cloud, is co-funded by Canada’s Digital Technology Supercluster and aims to increase Canada’s capacity to harness exponentially growing volumes of genomics and biomedical data to advance precision health. The platform will be used by data scientists and domain experts to help understand, predict, and treat COVID-19 with molecular precision. With a global death count of over 1.4 million people and record numbers of cases nationally, solutions that can help Canada respond to ongoing challenges of the pandemic are urgently needed.
A national consortium led by DNAstack will expand development of a software platform for genomics and health data and apply it to COVID-19.
The $5.1M project, called COVID Cloud, is co-funded by Canada’s Digital Technology Supercluster and aims to increase Canada’s capacity to harness exponentially growing volumes of genomics and biomedical data to advance precision health. The platform will be used by data scientists and domain experts to help understand, predict, and treat COVID-19 with molecular precision. With a global death count of over 1.4 million people and record numbers of cases nationally, solutions that can help Canada respond to ongoing challenges of the pandemic are urgently needed.
“We are proud to continue to support this consortium’s groundbreaking work through our COVID-19 program,” said Sue Paish, CEO of the Digital Technology Supercluster. “This project shows how Canadian partnerships across multiple organizations and sectors can drive innovation, help us address global health issues, showcase Canadian expertise, and position us well to rebuild and grow our economy.”
The project — a collaboration between BioSymetrics, Centre of Genomics and Policy at McGill University, DNAstack, FACIT, Genome BC, Mannin Research, McMaster University, Microsoft Canada, Ontario Genomics, Ontario Institute for Cancer Research, Roche Canada, Sunnybrook Research Institute, and Vector Institute — brings together Canadian leaders in software engineering, artificial intelligence, cloud computing, genomics, infectious disease, pharmaceuticals, commercialization, and policy. It leverages past work of partners to address needs of infectious disease research with guidance from domain experts.
“Tools that allow us to interrogate SARS-CoV-2 at a molecular level are essential to addressing this global health crisis, both now and in the future,” said Dr. Samira Mubareka, a microbiologist and infectious diseases physician at Sunnybrook, whose team was one of the first in Canada to isolate the novel coronavirus. “The insights we will learn by analysing integrated datasets using technology platforms like COVID Cloud can increase our preparedness for future waves and outbreaks.” Dr. Mubareka will co-chair the project’s translational science efforts along with Dr. Gabriel Musso, Chief Scientific Officer for BioSymetrics. “The infrastructure developed by this initiative will propel collaborative Canadian drug discovery efforts for COVID-19,” said Musso, whose team will lead bioinformatics and computational drug discovery for the project.
A major goal of the project is to make it easy for producers of genomic and health data to share data responsibly over industry standards, and for researchers to harness the collective power of information shared through them. The project deliverables include a suite of software products powered by enterprise-grade implementations of standards developed by Global Alliance for Genomics & Health (GA4GH), protocols that are being designed to facilitate the responsible sharing of genomic and health data, which will help advance precision medicine initiatives around the world.
“The platform is being built on a foundation of open standards that will allow for distributed networks of genomics and biomedical data to be built,” said Dr. Marc Fiume, CEO at DNAstack, whose team will lead software engineering for the project. “We are excited to see these technologies breaking down barriers to data sharing, access, and analysis and create new opportunities for genomics-based discoveries for our partners.”
This project is responding to global demand for highly specialized, scalable, distributed software infrastructure to support collaborative genomics research — a need that has surged since the onset of the COVID-19 pandemic. “COVID-19 has accelerated digital transformation of many industries, especially in healthcare,” said Kevin Peesker, President of Microsoft Canada. “The incredible power of Cloud applied to COVID at scale is expanding development of an information superhighway to securely connect scientists in Canada and around the world to the data and compute power they urgently need to help us overcome one of the greatest global health crises of our time.”
The platform will be used to support a series of projects in partnership with Canadian academic, clinical, and pharmaceutical collaborators, which are being coordinated by Canadian genome centres, Genome British Columbia and Ontario Genomics. These initial projects are being prioritized based on urgency and potential impact on Canada’s response to the COVID-19 pandemic.
“The COVID Cloud is an incredible platform that brings together resources and capacity to enable timely and comprehensive genomic analysis of SARS-CoV-2 for our province and our country,” said Bettina Hamelin, President and CEO of Ontario Genomics, whose team leads the ONCoV Genomics Coalition. “This made-in-Canada solution will immediately accelerate Canada’s response to COVID-19, while being a technological springboard for translating genomic data analysis into actionable medical insights across other disease areas in years to come.”
For more information, visit here.
DNAstack’s mission is to improve the lives of millions of people by breaking down barriers to data sharing and discovery. DNAstack develops standards and technologies for scientists to more efficiently find, access, and analyze the world’s exponentially growing volumes of genomic and biomedical data. For additional support or partnership interest, please contact us by email to info@dnastack.com.
The Digital Technology Supercluster solves some of industry's and society's biggest problems through Canadian-made technologies. We bring together private and public sector organizations of all sizes to address challenges facing Canada's economic sectors including healthcare, natural resources, manufacturing and transportation. Through this 'collaborative innovation' the Supercluster helps to drive solutions better than any single organization could on its own. The Digital Technology Supercluster is led by industry leaders such as D-Wave, Finger Food Advanced Technology Group, LifeLabs, LlamaZOO, Lululemon, MDA, Microsoft, Mosaic Forest Management, Sanctuary AI, Teck Resources Limited, TELUS,Terramera, and 1Qbit. Together, we work to position Canada as a global hub for digital innovation. A full list of Members can be found here.
The COVID-19 Program aims to improve the health and safety of Canadians and support Canada's ability to address issues created by the COVID-19 outbreak. In addition, the program will build expertise and capacity to anticipate and address issues that may arise in future health crises, from healthcare to a return to work and community. More information can be found here.
A national consortium led by DNAstack will expand development of a software platform for genomics and health data and apply it to COVID-19.
The $5.1M project, called COVID Cloud, is co-funded by Canada’s Digital Technology Supercluster and aims to increase Canada’s capacity to harness exponentially growing volumes of genomics and biomedical data to advance precision health. The platform will be used by data scientists and domain experts to help understand, predict, and treat COVID-19 with molecular precision. With a global death count of over 1.4 million people and record numbers of cases nationally, solutions that can help Canada respond to ongoing challenges of the pandemic are urgently needed.
A national consortium led by DNAstack will expand development of a software platform for genomics and health data and apply it to COVID-19.
The $5.1M project, called COVID Cloud, is co-funded by Canada’s Digital Technology Supercluster and aims to increase Canada’s capacity to harness exponentially growing volumes of genomics and biomedical data to advance precision health. The platform will be used by data scientists and domain experts to help understand, predict, and treat COVID-19 with molecular precision. With a global death count of over 1.4 million people and record numbers of cases nationally, solutions that can help Canada respond to ongoing challenges of the pandemic are urgently needed.
The Digital Technology Supercluster has made $10.7 million in follow-on investments to five projects under its COVID-19 stream, rounding out the Supercluster’s $60 million budget for the pandemic-focused program.
The Digital Technology Supercluster has made $10.7 million in follow-on investments to five projects under its COVID-19 stream, rounding out the Supercluster’s $60 million budget for the pandemic-focused program.
The Digital Technology Supercluster has made $10.7 million in follow-on investments to five projects under its COVID-19 stream, rounding out the Supercluster’s $60 million budget for the pandemic-focused program.
Read the full article on Betakit.
The Digital Technology Supercluster has made $10.7 million in follow-on investments to five projects under its COVID-19 stream, rounding out the Supercluster’s $60 million budget for the pandemic-focused program.
The Digital Technology Supercluster has made $10.7 million in follow-on investments to five projects under its COVID-19 stream, rounding out the Supercluster’s $60 million budget for the pandemic-focused program.
Despite advances in sequencing and analysis tools, calling variants in whole-genome sequencing (WGS) data is not trivial, even when dealing with only a few dozen samples.
When the number of samples reaches into the thousands, the time, computational resources, and file storage required for analysis can quickly become overwhelming. This was the challenge faced by the MSSNG team when they sought to joint-call the largest autism cohort yet sequenced — how could they process nearly 10,000 samples in a way that would be quick, reproducible, and allow for future expansion, all without breaking the bank?
One of the key directives of the initiative was to allow for painless future expansion of the dataset — namely, adding new samples without full reprocessing of the entire cohort. In addition, these outputs should be reproducible and consistent across sequencing technologies and analysis tools, so data from multiple experiments across time, labs, and experimental conditions could be combined and jointly analyzed. To that end, MSSNG researchers chose to analyze their WGS data using standards defined by the Centers for Common Disease Genomics (CCDG). The CCDG provides a set of standardized data processing steps for WGS data with a focus on producing functionally equivalent results (Regier et al., 2018). These steps cover the alignment, duplicate marking, and base quality score recalibration (BQSR) tasks that convert the raw FASTQ data to CRAM-format alignment files that may be used for long-term storage and future reanalysis (Figure 1).
Though not part of the CCDG pipeline itself, the CRAM output from this upstream pipeline is used to call variants (SNPs and small indels) on a per-sample basis, outputting genomic VCF (gVCF) files. Finally, the gVCF files for all samples are combined and joint-called to produce a single VCF file. Optionally, variant quality scores are then recalibrated (variant quality score recalibration, VQSR). See Figure 1 for an overview of the pipeline steps.
After extensive testing of concordance, cost, and speed, MSSNG chose to use Sentieon to process their WGS samples. Sentieon provides a licensed toolset that implements computationally-optimized versions of common variant-calling tools, providing results up to 10x faster than GATK’s best-practices pipeline while maintaining high concordance with GATK’s results (Freed et al., 2017). Sentieon publishes comprehensive documentation outlining how to run a CCDG-compliant upstream pipeline, as well as information on common downstream analysis steps such as per-sample SNP and indel calling using HaplotypeCaller, and joint genotyping and VQSR using their GVCFtyper and VarCal algorithms.
In the upstream part of the pipeline, raw FASTQ files are processed to per-sample CRAMs and gVCFs. This segment ran smoothly using Sentieon, taking an average of 4 hours per sample (64 core virtual machine (VM), 55 GB of RAM). In rare cases (~30/9,625 total samples) the alignment step ran out of memory and RAM was increased for these samples. Since many of the Sentieon algorithms are I/O-bound (that is, they are bottlenecked by the speed of reading and writing to the disk, rather than by CPU or memory usage), we also chose to use local SSDs for storage, which provide very fast I/O speeds.
We were able to run the upstream pipeline using preemptible VMs, a machine type that is provided at a much lower cost by Google but which may be shut down at any time if the resources are needed elsewhere. If a VM is shut down in this way, all progress on a task is lost and the task will be automatically restarted on another VM. If a VM is preempted frequently enough, the cost and time lost from running and rerunning the task can outweigh the savings of using a preemptible VM. Out of 8,377 successful runs that we inspected, we found that 7,163 runs were not preempted in any step. The average raw compute cost for these runs was $2.43 USD including storage, CPU, and RAM costs (not including the price of the Sentieon licence itself). We also observed that larger VMs (such as the 64 CPU/55GB RAM VMs used for the Sentieon steps) showed far less preemption events than smaller ones. The upstream pipeline was run in parallel, with ~500 samples run concurrently.
The majority of complications occurred during the joint genotyping step, which requires merging and joint genotyping all gVCF files generated using the upstream pipeline (1 per sample). Whereas the upstream pipeline can be run massively in parallel with each sample in a separate VM, joint genotyping requires the presence of all of the data in a single VM. This raises two issues: 1) disk size required, and 2) the runtime of the pipeline.
With each gzipped input gVCF file taking 15–25GB of space, the disk space required for analysis runs into the tens of terabytes for input files alone. The size of the merged output file, which is around the same as the sum of all the input files, must also be considered. While the Google disk size limit of 64 TB per VM should be enough to accommodate this, the size of the output file would make it unwieldy.
Despite the optimizations implemented by Sentieon, the speed of many processes is limited by the speed of the zip/unzip process (which by default runs on a single core) as both input and output files are gzipped to save space. This reality dramatically slows down the analysis, especially given the size of the files involved.
Sentieon provides a number of built-in solutions that help manage both the size of the final VCF file, as well as the speed of the analysis. First, joint genotyping may be split up to operate independently on different regions of the genome (much like many of GATK’s tools, which allow the analysis to be split up over intervals). This means that 1) the joint genotyping analysis may be run in parallel across intervals, and 2) we do not need to localize the full gVCF file for every sample in every shard — only the region corresponding to the interval we are joint calling in that shard (Figure 2a).
In order to read in only the required regions of the gVCFs without localizing the full files, we took advantage of a feature of htslib which allows bcftools to read directly from Google Cloud Storage locations. bcftools accesses Google credentials using the environment variable GCS_OAUTH_TOKEN, which can be defined as follows (assuming the user has authenticated with Google Cloud):
export GCS_OAUTH_TOKEN=$(gcloud auth application-default print-access-token)
To localize only the desired region of each gVCF file, the following command is used for each sample’s gVCF URL:
bcftools view -R ${region}.bed -Oz -o ${sample}_${region}.g.vcf.gz ${gvcf_url}
Each region.bed should specify a different region of the genome, e.g. chr1:1–50000000. The result of running this command for each gVCF URL is a smaller gVCF file that only includes calls for the region specified in the region.bed file. All of these partial gVCF files are then joint-genotyped together to output a partial VCF that has calls only for the specified region (Figure 2a). Since joint-genotyping is in this way split up into many smaller jobs that can be run in parallel for each region, the process is made considerably faster.
Once the partial VCFs for each region are produced, they must be merged together to form a final, complete VCF file that includes all regions. After some trial and error it was decided that rather than merging all regions of the genome together to form one large final VCF file, genomic regions would be merged on a per-chromosome basis in order to output 26 final VCF files (22 autosomes, chrX, chrY, chrM, and contig regions) (Figure 2a). Although the VCFs for each chromosome are still quite large, they are individually much more manageable than a single VCF containing all regions.
Since both the partial VCFs produced by joint-calling and the merged output file are gzipped, the process of merging these partial VCFs into a single file takes several days to a week to process even a single chromosome — once again, the speed of analysis is bound by the speed of the gzipping process. Had the merge step been run for the full dataset to produce a single output file, we predicted that this step would have taken upwards of a month to complete. By merging only the partial VCFs that made up each chromosome, we were able to run the process in parallel, meaning that the merge step took a little over a week to complete.
It should be noted that Sentieon provides a different method of reducing the size of the final VCF file — they allow the output VCF to be split by sample, rather than by chromosome. This would result in a single ‘main’ file that contains only the first nine columns of the VCF (CHROM, START, STOP, etc.), and then a number of ‘samples’ files that each contain the calls for n samples (n could be 100, 500, 1000, etc. — the smaller the number of samples per file, the smaller the size of each file). Both the ‘main’ file and each of the ‘samples’ files have the data for all chromosomes present, but since each only contains a subset of the samples, each file is quite a bit smaller (Figure 2b). Since none of these files are valid VCF files, an ‘extraction’ step must be performed in order to produce a valid VCF file by combining the first 9 columns from the ‘main’ file along with the desired sample columns from the various ‘samples’ files. Although here we chose to split the final VCF by chromosome rather than by sample to reduce final file size (since we required final VCFs that included every sample), we still took advantage of Sentieon’s ‘split_by_sample’ option because of the implications for VQSR runtime.
It has been mentioned that one of our biggest problems in dealing with a cohort of this size is the bottleneck introduced by the speed of unzipping/zipping large input and output files. We were able to improve the speed of our analysis by processing smaller regions of the genome in parallel rather than trying to read the entire genome sequentially. Coming up to the VQSR step however, we realized that in order to perform this step, which recalibrates variant quality scores across the entire dataset, information from the entire genome must be read in — that is, a single VCF including all chromosomes must be used as input, and a single VCF with all chromosomes would be output. Not only would we lose the benefit of having our VCF files split by chromosome and therefore more manageable in size since we would need the full VCF, this step would again take potentially weeks to read and write these massive gzipped VCF files.
This is where Sentieon’s ‘split_by_sample’ option came in handy. Although we wanted all samples together in the final VCF and so should not have needed this option, by using it to output all sample information in a single ‘samples’ file (containing only the sample IDs and call information, that is, all columns 10-onwards in the VCF), we were also able to produce a ‘main’ file for each chromosome. The ‘samples’ file for each chromosome is quite large since it contains all calls for all samples, however the ‘main’ file is at most a few hundred MBs and contains only columns 1–9 (Figure 2b). This file is all that is needed to perform VQSR, and since it is so much smaller, a process that may have taken weeks to complete could be performed in less than a day.
The ‘main’ file for all chromosomes was combined into a single VCF-like file that contained columns 1–9 for the entire genome; VQSR was performed on this file; finally, the full-genome ‘recalibrated main’ VCF file was split once more by chromosome, to output one ‘recalibrated main’ file per chromosome. These ‘recalibrated main’ files were combined with their respective ‘samples’ files for each chromosome using Sentieon’s extraction script in order to produce a single recalibrated VCF for each chromosome, each of which includes all samples (Figure 2c). Although this extraction process was still bound by the speed of the gzip/gunzip process, it was again able to be performed in parallel across chromosomes to reduce the total runtime needed.
Joint-genotyping the MSSNG cohort was an intensive effort that involved a collaboration between DNAstack, Sentieon, and MSSNG researchers. Sentieon’s CCDG-compliant algorithms allowed for quick, reproducible results that will support straightforward expansion in the future. The speed of the gzip process, as well as the size of output files, required creative solutions to common problems in order to improve speed and workflow costs; we look forward to further improvements in optimizing these processes for large cohorts to help accelerate research.
Today, DNAstack revealed a software system built on standards developed by the Global Alliance for Genomics and Health (GA4GH) to enable federated analysis of genomics and related health data.
The system, demonstrated at the 8th Plenary Meeting of GA4GH earlier this week, features the first integration of multiple emerging standards for data access, discovery, and cloud computing, which are foundational for connecting secure, scalable, and distributed data sharing networks.
Today, DNAstack revealed a software system built on standards developed by the Global Alliance for Genomics and Health (GA4GH) to enable federated analysis of genomics and related health data.
The system, demonstrated at the 8th Plenary Meeting of GA4GH earlier this week, features the first integration of multiple emerging standards for data access, discovery, and cloud computing, which are foundational for connecting secure, scalable, and distributed data sharing networks.
Today, DNAstack revealed a software system built on standards developed by the Global Alliance for Genomics and Health (GA4GH) to enable federated analysis of genomics and related health data.
The system, demonstrated at the 8th Plenary Meeting of GA4GH earlier this week, features the first integration of multiple emerging standards for data access, discovery, and cloud computing, which are foundational for connecting secure, scalable, and distributed data sharing networks.
Enabling federated data sharing and analysis is important for advancing genomic research, where institutional and regulatory policies can restrict the use of valuable datasets that would otherwise stay siloed from each other. GA4GH has brought together a global community of collaborators that have worked for years to design domain-specific standards that facilitate the responsible sharing of genomic and related health information.
“We’re excited to debut an integrated system of GA4GH standards to help break through long-standing barriers to federated analysis of genomic data,” said Max Barkley, Senior Software Developer, Technical Lead at DNAstack, and co-lead of GA4GH Federated Analysis Systems Project (FASP). “This marks a major milestone for DNAstack in our mission to accelerate genomics medicine through data sharing.”
The DNAstack system was demonstrated through a real-world analysis of controlled access data hosted on multiple cloud platforms, including from the Autism Speaks MSSNG Project. The system was presented as one of three GA4GH 2020 Connection Demos. The two other demonstrations showed reproducibility of a bioinformatics analysis run in multiple environments, and multi-directional interoperability by combining implementations from different organizations.
“The Connection Demos are an enormous success for the members of the GA4GH Work Streams, who have collectively dedicated thousands of hours over the last three years toward standards development,” said Ewan Birney, Deputy Director General of the European Molecular Biology Laboratory (EMBL), Director of EMBL’s European Bioinformatics Institute (EMBL-EBI), and Chair of GA4GH. “The demos show how this community’s work will enable interoperability across the genomics endeavour.”
The GA4GH 2020 Connection Demos highlighted how standards can be used vertically and horizontally to share data while complying with institutional, regional, national, and international regulations as well as across cloud and analytics environments. Data sharing across platforms and institutions will enable the research community to access and analyze the tens of millions of genome sequences that have been generated for research and healthcare purposes, which has the potential to rapidly accelerate our scientific understanding, particularly in rare and complex diseases.
DNAstack’s mission is to improve the lives of millions of people by breaking down barriers to data sharing and discovery. DNAstack develops standards and technologies for scientists to more efficiently find, access, and analyze the world’s exponentially growing volumes of genomic and biomedical data. For additional support or partnership interest, please contact us by email to info@dnastack.com.
Today, DNAstack revealed a software system built on standards developed by the Global Alliance for Genomics and Health (GA4GH) to enable federated analysis of genomics and related health data.
The system, demonstrated at the 8th Plenary Meeting of GA4GH earlier this week, features the first integration of multiple emerging standards for data access, discovery, and cloud computing, which are foundational for connecting secure, scalable, and distributed data sharing networks.
Today, DNAstack revealed a software system built on standards developed by the Global Alliance for Genomics and Health (GA4GH) to enable federated analysis of genomics and related health data.
The system, demonstrated at the 8th Plenary Meeting of GA4GH earlier this week, features the first integration of multiple emerging standards for data access, discovery, and cloud computing, which are foundational for connecting secure, scalable, and distributed data sharing networks.
DNAstack is helping scientists around the globe better understand COVID-19, so they can develop treatments and vaccines.
DNAstack is helping scientists around the globe better understand COVID-19, so they can develop treatments and vaccines.
DNAstack is helping scientists around the globe better understand COVID-19, so they can develop treatments and vaccines.
DNAstack is helping scientists around the globe better understand COVID-19, so they can develop treatments and vaccines.
DNAstack is helping scientists around the globe better understand COVID-19, so they can develop treatments and vaccines.
To combat the current COVID-19 pandemic, scientists around the world are sequencing viral genomes at an accelerated pace.
These sequences are then being deposited into a number of international databases, including the National Center for Biotechnology Information (NCBI; Figure 1). There are limitations to this approach as multiple databases makes it challenging for a single researcher on their own to consolidate data from different sources. These data were generated and processed by different research groups at different institutions resulting in batch effects when amalgamated -- differences in signal across groups of viral genomes processed together that represent technical noise and not biological variation. The distributed nature of the data and the lack of uniformity in data generation and processing hinders the pace of scientific discovery. To accelerate discovery, we need to leverage the breadth of data available internationally which requires data consolidation from multiple sources and data “cleaning” to reduce technical artifacts introduced during data processing.
To combat the current COVID-19 pandemic, scientists around the world are sequencing viral genomes at an accelerated pace.
These sequences are then being deposited into a number of international databases, including the National Center for Biotechnology Information (NCBI; Figure 1). There are limitations to this approach as multiple databases makes it challenging for a single researcher on their own to consolidate data from different sources. These data were generated and processed by different research groups at different institutions resulting in batch effects when amalgamated -- differences in signal across groups of viral genomes processed together that represent technical noise and not biological variation. The distributed nature of the data and the lack of uniformity in data generation and processing hinders the pace of scientific discovery. To accelerate discovery, we need to leverage the breadth of data available internationally which requires data consolidation from multiple sources and data “cleaning” to reduce technical artifacts introduced during data processing.
To combat the current COVID-19 pandemic, scientists around the world are sequencing viral genomes at an accelerated pace.
These sequences are then being deposited into a number of international databases, including the National Center for Biotechnology Information (NCBI; Figure 1). There are limitations to this approach as multiple databases makes it challenging for a single researcher on their own to consolidate data from different sources. These data were generated and processed by different research groups at different institutions resulting in batch effects when amalgamated -- differences in signal across groups of viral genomes processed together that represent technical noise and not biological variation. The distributed nature of the data and the lack of uniformity in data generation and processing hinders the pace of scientific discovery. To accelerate discovery, we need to leverage the breadth of data available internationally which requires data consolidation from multiple sources and data “cleaning” to reduce technical artifacts introduced during data processing.
To address this critical gap and accelerate scientific discovery, DNAstack released COVID Cloud, a cloud-based solution that uniquely indexes and integrates data from multiple international sources into a unified data lake. Identifying mutations in the viral genome can help researchers design novel therapeutics and track viral transmissions. COVID Cloud provides easily accessible viral genome data ready for analysis. The data in COVID Cloud can be browsed through apps providing different perspectives including faceted search, point lookup, and 3D visualizations. Users can also export the data into downstream analytical workspaces, such as Jupyter Notebooks, Power BI, or DNAstack’s Workflow Execution Service.
Figure 1
To facilitate high-quality, reproducible variant calling at scale, we have developed and published an open source workflow written in Workflow Descriptive Language (WDL) (available on Dockstore and Github). The workflow has two sub-workflows: one to handle long-read (i.e. Nanopore) and one to handle short-read, paired-end (i.e. Illumina) sequencing data (see Nanopore variant calling and Illumina paired-end variant calling sections for more details). These workflows are designed to take NCBI run accessions as input (used to access raw FASTQ files) and return high-confident variant calls (e.g. VCF files) as well as a consensus genome sequence.
We deployed our WDL variant calling workflows to identify mutations in 10,838 amplicon-based viral sequences hosted on NCBI (10,664 unique samples). Of these, 4,427 are Illumina paired-end short read sequences and 4,471 are Nanopore long read sequences. The resulting variant calls, as well as links to per-sample VCFs and assemblies, have been made freely available for exploration and download at COVID Cloud.
To further promote scalable and reproducible science, we have published both the Illumina and Nanopore WDL workflows. Below is a brief tutorial on how to call variants locally using the DNAstack variant calling workflow.
The COVID-19 variant calling workflow is available on DNAstack’s GitHub. The workflow may also be viewed on Dockstore. The tools and pipelines used in the workflow have been packaged into publicly available Docker images, allowing the workflow to be run reproducibly across compute environments. Instructions for running the pipeline locally in addition to test input files can be found in the workflow documentation. Briefly, to run the workflow locally, simply edit the input_template.json file found in the inputs directory of the GitHub repository to specify the parameters for the sample of interest, then run it with a workflow runner of your choice, e.g.:
Using Cromwell:
java -jar cromwell.jar run main.wdl -i input_template.json
Using miniwdl:
miniwdl run main.wdl -i input_template.json
The required parameters are the run accession from NCBI (explore SARS-CoV-2 run data); the library type (NANOPORE or ILLUMINA_PE) to determine which pipeline to run; and a file and version number indicating the primer scheme that was used to prepare the library.
Our Nanopore variant calling workflow leverages the ARTIC bioinformatics protocol (Loman et al., 2020), as implemented by the Connor lab. Briefly, reads are filtered and mapped to the SARS-CoV-2 reference genome (MN908947.3) using minimap2, following which amplicon primer sequences are trimmed. Next, medaka uses neural networks to create a consensus sequence (Oxford Nanopore Technologies, 2018). Medaka is also used to call variants, which are then fed into longshot to produce a set of high-confidence variants. Bcftools is used to generate the final consensus assembly sequence.
For more details please see the ARTIC bioinformatics protocol documentation and the GitHub for the nextflow implementation of this protocol we use in our workflow.
Our Illumina paired-end variant calling workflow leverages the SARS-CoV-2 Illumina GeNome Assembly Line (SIGNAL) protocol, produced by the McArthur lab. Briefly, reads are mapped to the human genome (GRCh38) using BWA-MEM to remove host reads (Li and Durbin, 2009). Next, adapters are trimmed and reads are mapped to the reference genome using BWA-MEM. Primer sequences are removed and variants are called using ivar with default parameters (Grubaugh et al., 2018). Ivar is also used to produce a consensus assembly sequence.
For more details please see the SIGNAL GitHub and documentation.
We are immensely grateful to the labs and individuals responsible for creating and open-sourcing the pipelines we have chosen to run this COVID analysis.
Our goals were twofold; we wanted to produce robust, reliable data that could be freely distributed to researchers in the hope that such a large volume of data could help provide novel insight into the virus, and we wanted to provide an easy-to-use pipeline for others aiming to process their own data or to reproduce our analysis on the NCBI data.
We will continue to iterate and improve upon our analysis pipelines, ingesting more data every day as it becomes available. In the near future, we plan to add an Illumina single-end pipeline, as well as a method to process metagenomic samples. We also plan to identify and ingest from more databases that expose raw sequencing data and metadata; while assembled genomes are valuable, per-site confidence can only be established when raw reads are available. Since our workflows are written in WDL and their environments are containerized, they can be reproducibly run in virtually any compute environment, ensuring accurate results in an infrastructure-independent way.
We believe that making science open, from the raw data, to analysis methods, to sharing results is essential not only to combat fast-moving diseases such as COVID, but also generally to increase the momentum of research across domains. It is our goal to continue to package and share best-practices workflows, analytics, data and results, so that researchers around the world can more rapidly leverage these resources that may otherwise not be available to them.
To combat the current COVID-19 pandemic, scientists around the world are sequencing viral genomes at an accelerated pace.
These sequences are then being deposited into a number of international databases, including the National Center for Biotechnology Information (NCBI; Figure 1). There are limitations to this approach as multiple databases makes it challenging for a single researcher on their own to consolidate data from different sources. These data were generated and processed by different research groups at different institutions resulting in batch effects when amalgamated -- differences in signal across groups of viral genomes processed together that represent technical noise and not biological variation. The distributed nature of the data and the lack of uniformity in data generation and processing hinders the pace of scientific discovery. To accelerate discovery, we need to leverage the breadth of data available internationally which requires data consolidation from multiple sources and data “cleaning” to reduce technical artifacts introduced during data processing.
To combat the current COVID-19 pandemic, scientists around the world are sequencing viral genomes at an accelerated pace.
These sequences are then being deposited into a number of international databases, including the National Center for Biotechnology Information (NCBI; Figure 1). There are limitations to this approach as multiple databases makes it challenging for a single researcher on their own to consolidate data from different sources. These data were generated and processed by different research groups at different institutions resulting in batch effects when amalgamated -- differences in signal across groups of viral genomes processed together that represent technical noise and not biological variation. The distributed nature of the data and the lack of uniformity in data generation and processing hinders the pace of scientific discovery. To accelerate discovery, we need to leverage the breadth of data available internationally which requires data consolidation from multiple sources and data “cleaning” to reduce technical artifacts introduced during data processing.
DNAstack today announced COVID Cloud, an online destination for exploring one of the largest collections of viral genome sequences in the world.
As the global caseload of COVID-19 exceeds 16.5 million people in over 200 countries, scientists are racing to study the genetics of the virus that causes it, SARS-CoV-2, to inform the development of urgently needed diagnostics and treatments. COVID Cloud is a software solution created by DNAstack that connects and shares a large and growing number of viral genomes seen around the world combined with visualization and analytical tools for scientists to examine the molecular machinery of the virus as it continues to spread and evolve.
DNAstack today announced COVID Cloud, an online destination for exploring one of the largest collections of viral genome sequences in the world.
As the global caseload of COVID-19 exceeds 16.5 million people in over 200 countries, scientists are racing to study the genetics of the virus that causes it, SARS-CoV-2, to inform the development of urgently needed diagnostics and treatments. COVID Cloud is a software solution created by DNAstack that connects and shares a large and growing number of viral genomes seen around the world combined with visualization and analytical tools for scientists to examine the molecular machinery of the virus as it continues to spread and evolve.
DNAstack today announced COVID Cloud, an online destination for exploring one of the largest collections of viral genome sequences in the world.
As the global caseload of COVID-19 exceeds 16.5 million people in over 200 countries, scientists are racing to study the genetics of the virus that causes it, SARS-CoV-2, to inform the development of urgently needed diagnostics and treatments. COVID Cloud is a software solution created by DNAstack that connects and shares a large and growing number of viral genomes seen around the world combined with visualization and analytical tools for scientists to examine the molecular machinery of the virus as it continues to spread and evolve.
“By sharing genetic data globally, we can mount a sort of digital immune response to help us defend against this and future outbreaks,” said Marc Fiume, CEO of DNAstack. “With COVID Cloud, we can help scientists take the best technologies in genomics, data sharing, cloud computing, and machine learning to the fight against COVID-19.”
COVID Cloud provides unified access to a globally representative repository of viral genomes, which is updated daily with new sequences from international biobanks. In order to reduce errors that arise when comparing datasets from multiple sources, DNAstack processes raw data using harmonized bioinformatics pipelines. These pipelines have been authored in the platform-agnostic Workflow Description Language and published as open source on Dockstore and Github, to promote reproducible science and community collaboration.
[caption id="attachment_2915" align="alignnone" width="800"]
Using Variants, researchers can search the entire catalog of mutations found in SARS-CoV-2 sequences[/caption]
Datasets are shared over an integrated set of APIs defined by the Global Alliance for Genomics & Health, providing a standards-compliant platform on which the community can build powerful integrations and applications. For example, all of the files in COVID Cloud are served over the GA4GH Data Repository Service, a vendor-neutral way of representing files, to streamline their use in downstream analytical environments such as Jupyter Notebooks, Microsoft Power BI, and DNAstack’s Workflow Execution Service.
COVID Cloud also gives scientists intuitive controls for interactive exploration of the data. Using the Sequences tool, users can search over the entire catalogue of genomics data and information about the original source, collection date, and geographic location. Beacon lets scientists look up the prevalence of specific genetic mutations, such as D614G, a variant that appears to make SARS-CoV-2 more transmissible. With Molecules, researchers can manipulate three-dimensional representations of proteins encoded by the viral genome, like the Spike protein, in order to understand their physical conformations and predict how genetic mutations and therapeutic interventions may impact their function.
[video width="350" mp4="https://dnastack.com/corporate/wp-content/uploads/2020/07/covid-cloud-molecules-iphone.mp4" loop="true" autoplay="true"][/video]
COVID Cloud is hosted by DNAstack as a free service deployed on Microsoft Azure. The software that powers COVID Cloud is available to license for sharing public or private collections of genomics and clinical data related to COVID-19 or other disease areas.
The development of COVID Cloud has been supported through feasibility funding of the Digital Technology Supercluster’s COVID-19 Program, which aims to improve the health and safety of Canadians and support Canada’s ability to address issues created by the COVID-19 outbreak.
DNAstack’s mission is to improve the lives of millions of people by breaking down barriers to data sharing and discovery. DNAstack develops standards and technologies for scientists to more efficiently find, access, and analyze the world’s exponentially growing volumes of genomic and biomedical data. For additional support or partnership interest, please contact us by email to info@dnastack.com.
The Digital Technology Supercluster solves some of industry’s and society’s biggest problems through Canadian-made technologies. We bring together private and public sector organizations of all sizes to address challenges facing Canada’s economic sectors including healthcare, natural resources, manufacturing, and transportation. Through this ‘collaborative innovation,’ the Supercluster helps to drive solutions better than any single organization could on its own. The Digital Technology Supercluster is led by industry leaders such as D-Wave, LifeLabs, LlamaZOO, Lululemon, MDA, Microsoft, Mosaic Forest Management, Sanctuary AI, Teck Resources Limited, TELUS, Terramera, and 1Qbit. Together, we work to position Canada as a global hub for digital innovation. A full list of Members can be found here.
For DNAstack media inquiries: Christine Beyaert, christine@dnastack.com
For Digital Technology Supercluster related media inquiries: Elysa Darling, elysa@switchboardpr.com
DNAstack today announced COVID Cloud, an online destination for exploring one of the largest collections of viral genome sequences in the world.
As the global caseload of COVID-19 exceeds 16.5 million people in over 200 countries, scientists are racing to study the genetics of the virus that causes it, SARS-CoV-2, to inform the development of urgently needed diagnostics and treatments. COVID Cloud is a software solution created by DNAstack that connects and shares a large and growing number of viral genomes seen around the world combined with visualization and analytical tools for scientists to examine the molecular machinery of the virus as it continues to spread and evolve.
DNAstack today announced COVID Cloud, an online destination for exploring one of the largest collections of viral genome sequences in the world.
As the global caseload of COVID-19 exceeds 16.5 million people in over 200 countries, scientists are racing to study the genetics of the virus that causes it, SARS-CoV-2, to inform the development of urgently needed diagnostics and treatments. COVID Cloud is a software solution created by DNAstack that connects and shares a large and growing number of viral genomes seen around the world combined with visualization and analytical tools for scientists to examine the molecular machinery of the virus as it continues to spread and evolve.
DNAstack's bioinformatician Heather Ward breaks down the biology of the novel coronavirus responsible for the COVID-19 outbreak.
DNAstack's bioinformatician Heather Ward breaks down the biology of the novel coronavirus responsible for the COVID-19 outbreak.
DNAstack's bioinformatician Heather Ward breaks down the biology of the novel coronavirus responsible for the COVID-19 outbreak.
Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) is the novel coronavirus responsible for the COVID-19 outbreak that first emerged in early December 2019 in Wuhan, China. As of March 20, 2020 SARS-CoV-2 has resulted in nearly 250,000 cases worldwide, claiming the lives of over 10,000 people.
Here, I’ll briefly break down the potential origins and viral life cycle of SARS-CoV-2, how it differs from the virus responsible for the 2002 outbreak, and how genomics and open science can be used to explore and develop therapeutics that will help mitigate this global threat.
SARS-CoV-2 is a coronavirus, members of a class of positive-sense single-stranded RNA (ssRNA) viruses so named due to their resemblance to solar coronas. Other ssRNA viruses cause diseases which range in severity, including HIV, West Nile, and the common cold.
There are several coronaviruses known to infect humans, with the most well-known being SARS-CoV (responsible for the 2002 outbreak) and MERS-CoV (Middle Eastern Respiratory Syndrome Coronavirus). Both of these coronaviruses, as well as the current SARS-CoV-2, are believed to have originated in bats, which act as a natural reservoir for a number of coronaviruses. The virus is postulated to pass to humans via an intermediary host (civet cats in the case of SARS-CoV, and dromedary camels for MERS-CoV). Several potential hosts have been suggested as the intermediary for the current SARS-CoV-2, including snakes and pangolins.
It’s important to note that the majority of these bat-endemic coronaviruses are not able to infect humans, and mutation is required for a coronavirus to be able to transition to a new host organism. To obtain insight into which parts of the genome require mutation to allow a virus the ability to target a new host first requires an understanding of the basics of the coronavirus viral life cycle.
[caption id="attachment_3926" align="aligncenter" width="701"]
Figure 1: SARS-CoV-2 virion. [/caption]
The major steps of the viral life cycle of SARS-CoV-2 as well as other coronaviruses include:
Mature virions (packaged viral particles including the viral genome and structural proteins, see the SARS-CoV-2 virion pictured in figure 1) released from an infected host cell may infect other cells and continue the infection cycle.
If virions are unable to bind to host cell receptors or if membrane fusion does not occur, infection will not take place. These key steps are both mediated by a particular viral protein — the spike protein.
The spike protein is a homotrimeric (made up of three identical peptides) transmembrane protein found studded around the exterior of the mature virion. Each monomer (one of the three identical peptides) is comprised of two subunits: the S1 subunit, which is responsible for recognizing and binding to a host cell receptor, and the S2 subunit, which facilitates membrane fusion and release of the viral genome into the host cell (see figure 2).
Because the virus can only infect host cells that it is able to bind to, the S1 subunit of the spike protein is responsible for host specificity — the range of hosts that the virus is able to infect. In order for a virus to be able to infect a new organism — e.g. in the transition between bat and human hosts — the receptor binding domain of the S1 subunit must gain the ability to bind to a receptor found in that new host. In both SARS-CoV and SARS-CoV-2, the human receptor appears to be the protein angiotensin converting enzyme 2 (ACE2), which is found on the surface of cells in the human respiratory tract. Interestingly, despite targeting the same receptor protein, many of the key amino acids that interact with the ACE2 receptor and that were previously thought to be essential for binding to ACE2 appear to be almost completely distinct between the SARS-CoV and SARS-CoV-2 receptor binding domains, implying that specificity for the same receptor may have evolved independently in each strain.
[caption id="attachment_3928" align="aligncenter" width="587"]
Figure 2: Structure of the SARS-CoV spike protein monomer (blue and green) bound to the ACE2 receptor (yellow). The spike protein is comprised of the S1 (blue) and S2 (green) subunits. S1/S2 and S2' cleavage sites are labelled in red. Generated using open-source PyMOL™ from the cryo-EM structure.[/caption]
Receptor binding alone is not sufficient for viral infection. Binding initiates conformational changes in the spike protein that lead to membrane fusion and infection, but another step is required before fusion can take place: cleavage of the spike protein.
There are at least two cleavage sites on the spike protein that must be cut prior to viral entry; one between the S1 and S2 subunits (S1/S2 site) and one internal to the S2 subunit (S2' site) (see figure 2, red). Cleavage at the S1/S2 site primes the protein and leads to cleavage of the S2' site, which is necessary for membrane fusion. The specific proteases (proteins that cut other proteins) that are able to perform the cleavage steps depend on the amino acid sequence that is present at each cleavage site; in many cases, several different proteases are able to cut the same site with greater or lesser efficiency.
Similar to the host-specificity of the receptor binding domain, if cleavage sites are not recognized by host proteases, cleavage and therefore infection will not be able to occur in that host. This means that both a receptor binding domain that recognizes a host target as well as cleavage sites that can be cut by host proteases are required for transmission of the virus to a novel host. For example, some bat coronaviruses have been found that are able to bind to human proteins but fail to initiate infection because their spike protein is not cleaved in human hosts.
In SARS-CoV-2, a novel cleavage site has been discovered at the S1/S2 junction which is cleaved by a ubiquitous human protease known as furin. The inclusion of this novel furin site allows the SARS-CoV-2 spike protein to be cleaved during biosynthesis — this means that the protein is ‘primed’ even prior to release of the virion from the host cell. This is in contrast to the spike protein produced by SARS-CoV, which lacks this site and is released from the cell intact, requiring later cleavage before it can facilitate membrane fusion.
It is unclear whether priming during biosynthesis has an impact on viral infectivity; a 2006 study by Follis et al. found that the introduction of a furin cleavage site into SARS-CoV’s spike protein at the S1/S2 junction resulted in enhanced membrane fusion between virus and host, but could find no evidence for an accompanying increase in infectivity. It remains to be seen how the novel furin site in SARS-CoV-2 will impact its infectivity and spread.
Researchers across the globe are searching the SARS-CoV-2 genome for features that will allow it to be targeted by therapeutic agents. Due to the nature of the spike protein and its fundamental role in mediating host specificity and viral infection, it represents an attractive target for the development of therapeutic agents. In particular, mechanisms targeting receptor binding, proteolytic cleavage, and membrane fusion may prove effective in attenuating the virus’s ability to infect human cells. Due to the genetic similarity between the novel SARS-CoV-2 and SARS-CoV, including their shared receptor target, it is possible that agents shown to be effective against SARS-CoV may also prove effective at slowing SARS-CoV-2.
The swift response of researchers worldwide to study SARS-CoV-2 and to share sequencing data publicly has allowed for rapid insights into key genetic features that will prove indispensable in the days and months to come. This tremendous, coordinated global effort to elucidate the origins and mechanisms of the virus could not have been accomplished without the aid of modern technologies allowing researchers to share data quickly across geopolitical borders. This reaffirms the essential role of technology in facilitating science, especially in the ability to respond quickly to global emergencies.
To that end, DNAstack has developed a beacon for SARS-CoV-2 where users can explore aggregated genetic variants discovered by labs worldwide. Explore it here: covid-19.dnastack.com.
Heather is part of the Data Science Team at DNAstack, where she authors, tests, and runs analytical pipelines for internal and customer projects
DNAstack’s mission is to improve the lives of millions of people by breaking down barriers to data sharing and discovery. DNAstack develops standards and technologies for scientists to more efficiently find, access, and analyze the world’s exponentially growing volumes of genomic and biomedical data. For additional support or partnership interest, please contact us by email to info@dnastack.com.
Figure 1: CDC/Alissa Eckert, MS; Dan Higgins, MAMSFigure 2: Song et al., 2018; PDB accession 6ACK.
DNAstack's bioinformatician Heather Ward breaks down the biology of the novel coronavirus responsible for the COVID-19 outbreak.
DNAstack's bioinformatician Heather Ward breaks down the biology of the novel coronavirus responsible for the COVID-19 outbreak.
Canada's Digital Technology Supercluster is supporting Canadians in the fight against COVID-19.
Canada's Digital Technology Supercluster is supporting Canadians in the fight against COVID-19.
Canada's Digital Technology Supercluster is supporting Canadians in the fight against COVID-19.
Canada's Digital Technology Supercluster is supporting Canadians in the fight against COVID-19.
Canada's Digital Technology Supercluster is supporting Canadians in the fight against COVID-19.
DNAstack today introduced a Beacon for SARS-CoV-2, commonly known as COVID-19, available at covid-19.dnastack.com.
DNAstack today introduced a Beacon for SARS-CoV-2, commonly known as COVID-19, available at covid-19.dnastack.com.
DNAstack today introduced a Beacon for SARS-CoV-2, commonly known as COVID-19, available at covid-19.dnastack.com.
DNAstack today introduced a Beacon for SARS-CoV-2, commonly known as COVID-19, available at covid-19.dnastack.com.
DNAstack today introduced a Beacon for SARS-CoV-2, commonly known as COVID-19, available at covid-19.dnastack.com.
Toronto company is ‘getting noticed’ as it works to build the digital infrastructure to power the next generation of scientific research.
Toronto company is ‘getting noticed’ as it works to build the digital infrastructure to power the next generation of scientific research.
Toronto company is ‘getting noticed’ as it works to build the digital infrastructure to power the next generation of scientific research.
Toronto company is ‘getting noticed’ as it works to build the digital infrastructure to power the next generation of scientific research.
Toronto company is ‘getting noticed’ as it works to build the digital infrastructure to power the next generation of scientific research.
Here we describe the Beacon protocol and how it can be used as a model for the federated discovery and sharing of genomic data.
The Government of Canada is investing up to $950 million over five years to support industry-led innovation superclusters across the country and accelerate economic growth, productivity, and competitiveness across five Superclusters.
The Government of Canada is investing up to $950 million over five years to support industry-led innovation superclusters across the country and accelerate economic growth, productivity, and competitiveness across five Superclusters.
The Government of Canada is investing up to $950 million over five years to support industry-led innovation superclusters across the country and accelerate economic growth, productivity, and competitiveness across five Superclusters.
The Government of Canada is investing up to $950 million over five years to support industry-led innovation superclusters across the country and accelerate economic growth, productivity, and competitiveness across five Superclusters.
The Government of Canada is investing up to $950 million over five years to support industry-led innovation superclusters across the country and accelerate economic growth, productivity, and competitiveness across five Superclusters.
Newly released APIs are the first products from the Global Alliance for Genomics and Health's strategic roadmap for interoperability of genomic data.
Newly released APIs are the first products from the Global Alliance for Genomics and Health's strategic roadmap for interoperability of genomic data.
Newly released APIs are the first products from the Global Alliance for Genomics and Health's strategic roadmap for interoperability of genomic data.
Newly released APIs are the first products from the Global Alliance for Genomics and Health's strategic roadmap for interoperability of genomic data.
Newly released APIs are the first products from the Global Alliance for Genomics and Health's strategic roadmap for interoperability of genomic data.
ClinGen has joined with the Global Alliance for Genomics and Health (GA4GH) to support the development of open, freely‐available technical standards and regulatory frameworks for secure and responsible sharing of genomic and health‐related data.
ClinGen has joined with the Global Alliance for Genomics and Health (GA4GH) to support the development of open, freely‐available technical standards and regulatory frameworks for secure and responsible sharing of genomic and health‐related data.
ClinGen has joined with the Global Alliance for Genomics and Health (GA4GH) to support the development of open, freely‐available technical standards and regulatory frameworks for secure and responsible sharing of genomic and health‐related data.
ClinGen has joined with the Global Alliance for Genomics and Health (GA4GH) to support the development of open, freely‐available technical standards and regulatory frameworks for secure and responsible sharing of genomic and health‐related data.
ClinGen has joined with the Global Alliance for Genomics and Health (GA4GH) to support the development of open, freely‐available technical standards and regulatory frameworks for secure and responsible sharing of genomic and health‐related data.
Project partners will expand on infrastructure developed by DNAstack for accessing genomic data and explore patient consent models that support nationwide sharing.
The Global Alliance for Genomics and Health (GA4GH) proposes a data access policy model—“registered access”—to increase and improve access to data requiring an agreement to basic terms and conditions, such as the use of DNA sequence and health data in research.
The Global Alliance for Genomics and Health (GA4GH) proposes a data access policy model—“registered access”—to increase and improve access to data requiring an agreement to basic terms and conditions, such as the use of DNA sequence and health data in research.
The Global Alliance for Genomics and Health (GA4GH) proposes a data access policy model—“registered access”—to increase and improve access to data requiring an agreement to basic terms and conditions, such as the use of DNA sequence and health data in research.
The Global Alliance for Genomics and Health (GA4GH) proposes a data access policy model—“registered access”—to increase and improve access to data requiring an agreement to basic terms and conditions, such as the use of DNA sequence and health data in research.
The Global Alliance for Genomics and Health (GA4GH) proposes a data access policy model—“registered access”—to increase and improve access to data requiring an agreement to basic terms and conditions, such as the use of DNA sequence and health data in research.
DNAstack today announced its participation in a new project to accelerate the development of a national software platform for precision health in Canada.
The project — in which Deloitte, Genome BC, LifeLabs, Microsoft, Molecular You, Provincial Health Services Authority, and the University of British Columbia will also participate — is among the first to be selected and launched as part of Canada’s Digital Technology Supercluster, a federally funded program that recently received over $150M to stimulate the creation of competitive and innovative digital technology solutions for top industries.
With support from the Canadian government, the team is building a powerful new software platform that will make it easier for healthcare organizations, academic researchers, clinical laboratories, pharmaceutical companies, and other innovators to harness exponentially growing volumes of genomic and biomedical data. The platform will help drive new scientific discoveries and inform medical decisions, translating into more personalized and cost-effective healthcare for millions of Canadians.
The platform has been designed from the ground up around modern principles of data security, sharing, and analysis, and serves as an alternative path for organizations looking to avoid enormous, ongoing cost burdens associated with purchasing and maintaining local computational infrastructure. The platform is already being piloted with early adopters across the country, where it has proven to be dramatically more powerful, secure, cost-efficient, and accessible compared to other existing solutions. The project team aims to deliver the most advanced platform for precision health in the country, positioning healthcare organizations to roll out new programs that reap significant health and economic benefits for years to come.
“We’re laying the foundation for the future of genomic and biomedical science, where the combination of networked data and powerful technology is used to generate life-saving insights faster than ever before,” said Dr. Marc Fiume, CEO and Co-Founder at DNAstack. “With this platform, we’re empowering scientists to take big data, cloud computing, and machine learning to the fight against the biggest challenges in health.”
"We’re empowering scientists to take big data, cloud computing, and machine learning to the fight against the biggest challenges in health." — Marc Fiume, CEO at DNAstack
The platform will provide easy to use tools for data producers (e.g. principal investigators, diagnostics laboratories, hospital systems, patient advocacy groups, individuals) to connect and administer the secure sharing of their datasets, and for data consumers (e.g. academic, clinical, pharmaceutical, and industry researchers) to discover and analyze that data using both gold standard and custom applications. Individual users of the platform will be able to perform intense statistical and machine learning analyses with on-demand access to hundreds of thousands of compute cores, more than 10 times the computing power of some of the most equipped research institutions in Canada.
For DNAstack, the project is a continuation of years of global leadership and product innovation in the space. Since 2014, DNAstack has been an active member of the Global Alliance for Genomics & Health (GA4GH), where it contributes to the development of open standards for interoperable data sharing and analysis. This project is integrating key GA4GH protocols for identity, access, discovery, and analysis. In 2018, DNAstack co-founded the Canadian Genomics Cloud, the most computationally powerful public cloud platform for genomics and precision medicine in Canada, which is actively being used by leading scientists across the country to study the genetic causes of autism, adult cancer, pediatric cancer, heart disease, mental health, cystic fibrosis, and other rare diseases. DNAstack is now working in close collaboration with partners of the Digital Technology Supercluster, having diverse and complementary expertise, to introduce entirely new features to the market.
“We are supporting ambitious opportunities that can’t be tackled by one company alone. Through a collective effort, this project aims to make a global impact and position Canada as a world leader in health,” said Bill Tam, Vice President of Business Development and Partner Relations for Canada’s Digital Technology Supercluster. “We are proud that the Supercluster has created an elevated platform for leading Canadian SMEs like DNAstack to continue to innovate and grow.” — Bill Tam, Vice Presdient of Business Development and Partner Releations, Canada's Digital Technology Supercluster
The Autism Speaks MSSNG project will help researchers answer the many remaining questions about the genetic underpinnings of autism by sequencing the DNA of over 10,000 families affected by autism.
The Autism Speaks MSSNG project will help researchers answer the many remaining questions about the genetic underpinnings of autism by sequencing the DNA of over 10,000 families affected by autism.
The Autism Speaks MSSNG project will help researchers answer the many remaining questions about the genetic underpinnings of autism by sequencing the DNA of over 10,000 families affected by autism.
The Autism Speaks MSSNG project will help researchers answer the many remaining questions about the genetic underpinnings of autism by sequencing the DNA of over 10,000 families affected by autism.
The Autism Speaks MSSNG project will help researchers answer the many remaining questions about the genetic underpinnings of autism by sequencing the DNA of over 10,000 families affected by autism.
The volume of genomics and health data is growing rapidly, driven by sequencing for both research and clinical use.
The volume of genomics and health data is growing rapidly, driven by sequencing for both research and clinical use.
The volume of genomics and health data is growing rapidly, driven by sequencing for both research and clinical use.
The volume of genomics and health data is growing rapidly, driven by sequencing for both research and clinical use.
The volume of genomics and health data is growing rapidly, driven by sequencing for both research and clinical use.
The Global Alliance for Genomics and Health (GA4GH) has announced their Strategic Roadmap, which includes a series of more than two dozen deliverables to be launched in 2018 and developed over the next one to three years.
The Global Alliance for Genomics and Health (GA4GH) has announced their Strategic Roadmap, which includes a series of more than two dozen deliverables to be launched in 2018 and developed over the next one to three years.
The Global Alliance for Genomics and Health (GA4GH) has announced their Strategic Roadmap, which includes a series of more than two dozen deliverables to be launched in 2018 and developed over the next one to three years.
The Global Alliance for Genomics and Health (GA4GH) has announced their Strategic Roadmap, which includes a series of more than two dozen deliverables to be launched in 2018 and developed over the next one to three years.
The Global Alliance for Genomics and Health (GA4GH) has announced their Strategic Roadmap, which includes a series of more than two dozen deliverables to be launched in 2018 and developed over the next one to three years.
DNAstack, Canada’s Genomics Enterprise, Google, the Centre of Genomics and Policy, and more announce the launch of the Canadian Genomics Cloud (CGC): a national cloud-based infrastructure for genomics initiatives to share data across Canada.
DNAstack, Canada’s Genomics Enterprise, Google, the Centre of Genomics and Policy, and more announce the launch of the Canadian Genomics Cloud (CGC): a national cloud-based infrastructure for genomics initiatives to share data across Canada.
DNAstack, Canada’s Genomics Enterprise, Google, the Centre of Genomics and Policy, and more announce the launch of the Canadian Genomics Cloud (CGC): a national cloud-based infrastructure for genomics initiatives to share data across Canada.
DNAstack, Canada’s Genomics Enterprise, Google, the Centre of Genomics and Policy, and more announce the launch of the Canadian Genomics Cloud (CGC): a national cloud-based infrastructure for genomics initiatives to share data across Canada.
DNAstack, Canada’s Genomics Enterprise, Google, the Centre of Genomics and Policy, and more announce the launch of the Canadian Genomics Cloud (CGC): a national cloud-based infrastructure for genomics initiatives to share data across Canada.
Not only was Michael Szego the ethics lead on the Personal Genome Project Canada — he was also a participant, agreeing to have his genome mapped and shared publicly.
Not only was Michael Szego the ethics lead on the Personal Genome Project Canada — he was also a participant, agreeing to have his genome mapped and shared publicly.
Not only was Michael Szego the ethics lead on the Personal Genome Project Canada — he was also a participant, agreeing to have his genome mapped and shared publicly.
Not only was Michael Szego the ethics lead on the Personal Genome Project Canada — he was also a participant, agreeing to have his genome mapped and shared publicly.
Not only was Michael Szego the ethics lead on the Personal Genome Project Canada — he was also a participant, agreeing to have his genome mapped and shared publicly.
The Personal Genome Project Canada is a comprehensive public data resource that integrates whole genome sequencing data and health information.
The Personal Genome Project Canada is a comprehensive public data resource that integrates whole genome sequencing data and health information.
The Personal Genome Project Canada is a comprehensive public data resource that integrates whole genome sequencing data and health information.
The Personal Genome Project Canada is a comprehensive public data resource that integrates whole genome sequencing data and health information.
The Personal Genome Project Canada is a comprehensive public data resource that integrates whole genome sequencing data and health information.