Despite advances in sequencing and analysis tools, calling variants in whole-genome sequencing (WGS) data is not trivial, even when dealing with only a few dozen samples.
When the number of samples reaches into the thousands, the time, computational resources, and file storage required for analysis can quickly become overwhelming. This was the challenge faced by the MSSNG team when they sought to joint-call the largest autism cohort yet sequenced — how could they process nearly 10,000 samples in a way that would be quick, reproducible, and allow for future expansion, all without breaking the bank?
The pipeline
One of the key directives of the initiative was to allow for painless future expansion of the dataset — namely, adding new samples without full reprocessing of the entire cohort. In addition, these outputs should be reproducible and consistent across sequencing technologies and analysis tools, so data from multiple experiments across time, labs, and experimental conditions could be combined and jointly analyzed. To that end, MSSNG researchers chose to analyze their WGS data using standards defined by the Centers for Common Disease Genomics (CCDG). The CCDG provides a set of standardized data processing steps for WGS data with a focus on producing functionally equivalent results (Regier et al., 2018). These steps cover the alignment, duplicate marking, and base quality score recalibration (BQSR) tasks that convert the raw FASTQ data to CRAM-format alignment files that may be used for long-term storage and future reanalysis (Figure 1).
Though not part of the CCDG pipeline itself, the CRAM output from this upstream pipeline is used to call variants (SNPs and small indels) on a per-sample basis, outputting genomic VCF (gVCF) files. Finally, the gVCF files for all samples are combined and joint-called to produce a single VCF file. Optionally, variant quality scores are then recalibrated (variant quality score recalibration, VQSR). See Figure 1 for an overview of the pipeline steps.
The tools
After extensive testing of concordance, cost, and speed, MSSNG chose to use Sentieon to process their WGS samples. Sentieon provides a licensed toolset that implements computationally-optimized versions of common variant-calling tools, providing results up to 10x faster than GATK’s best-practices pipeline while maintaining high concordance with GATK’s results (Freed et al., 2017). Sentieon publishes comprehensive documentation outlining how to run a CCDG-compliant upstream pipeline, as well as information on common downstream analysis steps such as per-sample SNP and indel calling using HaplotypeCaller, and joint genotyping and VQSR using their GVCFtyper and VarCal algorithms.
Challenges and optimizations
Upstream Pipeline: FASTQ -> CRAM, GVCF
In the upstream part of the pipeline, raw FASTQ files are processed to per-sample CRAMs and gVCFs. This segment ran smoothly using Sentieon, taking an average of 4 hours per sample (64 core virtual machine (VM), 55 GB of RAM). In rare cases (~30/9,625 total samples) the alignment step ran out of memory and RAM was increased for these samples. Since many of the Sentieon algorithms are I/O-bound (that is, they are bottlenecked by the speed of reading and writing to the disk, rather than by CPU or memory usage), we also chose to use local SSDs for storage, which provide very fast I/O speeds.
We were able to run the upstream pipeline using preemptible VMs, a machine type that is provided at a much lower cost by Google but which may be shut down at any time if the resources are needed elsewhere. If a VM is shut down in this way, all progress on a task is lost and the task will be automatically restarted on another VM. If a VM is preempted frequently enough, the cost and time lost from running and rerunning the task can outweigh the savings of using a preemptible VM. Out of 8,377 successful runs that we inspected, we found that 7,163 runs were not preempted in any step. The average raw compute cost for these runs was $2.43 USD including storage, CPU, and RAM costs (not including the price of the Sentieon licence itself). We also observed that larger VMs (such as the 64 CPU/55GB RAM VMs used for the Sentieon steps) showed far less preemption events than smaller ones. The upstream pipeline was run in parallel, with ~500 samples run concurrently.
Downstream Pipeline: Joint Genotyping
The majority of complications occurred during the joint genotyping step, which requires merging and joint genotyping all gVCF files generated using the upstream pipeline (1 per sample). Whereas the upstream pipeline can be run massively in parallel with each sample in a separate VM, joint genotyping requires the presence of all of the data in a single VM. This raises two issues: 1) disk size required, and 2) the runtime of the pipeline.
Disk Size Requirements
With each gzipped input gVCF file taking 15–25GB of space, the disk space required for analysis runs into the tens of terabytes for input files alone. The size of the merged output file, which is around the same as the sum of all the input files, must also be considered. While the Google disk size limit of 64 TB per VM should be enough to accommodate this, the size of the output file would make it unwieldy.
Pipeline Runtime
Despite the optimizations implemented by Sentieon, the speed of many processes is limited by the speed of the zip/unzip process (which by default runs on a single core) as both input and output files are gzipped to save space. This reality dramatically slows down the analysis, especially given the size of the files involved.
Solutions
Splitting Up Joint-Genotyping By Region
Sentieon provides a number of built-in solutions that help manage both the size of the final VCF file, as well as the speed of the analysis. First, joint genotyping may be split up to operate independently on different regions of the genome (much like many of GATK’s tools, which allow the analysis to be split up over intervals). This means that 1) the joint genotyping analysis may be run in parallel across intervals, and 2) we do not need to localize the full gVCF file for every sample in every shard — only the region corresponding to the interval we are joint calling in that shard (Figure 2a).
In order to read in only the required regions of the gVCFs without localizing the full files, we took advantage of a feature of htslib which allows bcftools to read directly from Google Cloud Storage locations. bcftools accesses Google credentials using the environment variable GCS_OAUTH_TOKEN, which can be defined as follows (assuming the user has authenticated with Google Cloud):
export GCS_OAUTH_TOKEN=$(gcloud auth application-default print-access-token)
To localize only the desired region of each gVCF file, the following command is used for each sample’s gVCF URL:
bcftools view -R ${region}.bed -Oz -o ${sample}_${region}.g.vcf.gz ${gvcf_url}
Each region.bed should specify a different region of the genome, e.g. chr1:1–50000000. The result of running this command for each gVCF URL is a smaller gVCF file that only includes calls for the region specified in the region.bed file. All of these partial gVCF files are then joint-genotyped together to output a partial VCF that has calls only for the specified region (Figure 2a). Since joint-genotyping is in this way split up into many smaller jobs that can be run in parallel for each region, the process is made considerably faster.
Merging Joint-Genotyped Files by Chromosome
Once the partial VCFs for each region are produced, they must be merged together to form a final, complete VCF file that includes all regions. After some trial and error it was decided that rather than merging all regions of the genome together to form one large final VCF file, genomic regions would be merged on a per-chromosome basis in order to output 26 final VCF files (22 autosomes, chrX, chrY, chrM, and contig regions) (Figure 2a). Although the VCFs for each chromosome are still quite large, they are individually much more manageable than a single VCF containing all regions.
Since both the partial VCFs produced by joint-calling and the merged output file are gzipped, the process of merging these partial VCFs into a single file takes several days to a week to process even a single chromosome — once again, the speed of analysis is bound by the speed of the gzipping process. Had the merge step been run for the full dataset to produce a single output file, we predicted that this step would have taken upwards of a month to complete. By merging only the partial VCFs that made up each chromosome, we were able to run the process in parallel, meaning that the merge step took a little over a week to complete.
Sentieon's " Split_By_Sample" Option For Large VCF'S
It should be noted that Sentieon provides a different method of reducing the size of the final VCF file — they allow the output VCF to be split by sample, rather than by chromosome. This would result in a single ‘main’ file that contains only the first nine columns of the VCF (CHROM, START, STOP, etc.), and then a number of ‘samples’ files that each contain the calls for n samples (n could be 100, 500, 1000, etc. — the smaller the number of samples per file, the smaller the size of each file). Both the ‘main’ file and each of the ‘samples’ files have the data for all chromosomes present, but since each only contains a subset of the samples, each file is quite a bit smaller (Figure 2b). Since none of these files are valid VCF files, an ‘extraction’ step must be performed in order to produce a valid VCF file by combining the first 9 columns from the ‘main’ file along with the desired sample columns from the various ‘samples’ files. Although here we chose to split the final VCF by chromosome rather than by sample to reduce final file size (since we required final VCFs that included every sample), we still took advantage of Sentieon’s ‘split_by_sample’ option because of the implications for VQSR runtime.
Running VQSR On A Large DataSet
It has been mentioned that one of our biggest problems in dealing with a cohort of this size is the bottleneck introduced by the speed of unzipping/zipping large input and output files. We were able to improve the speed of our analysis by processing smaller regions of the genome in parallel rather than trying to read the entire genome sequentially. Coming up to the VQSR step however, we realized that in order to perform this step, which recalibrates variant quality scores across the entire dataset, information from the entire genome must be read in — that is, a single VCF including all chromosomes must be used as input, and a single VCF with all chromosomes would be output. Not only would we lose the benefit of having our VCF files split by chromosome and therefore more manageable in size since we would need the full VCF, this step would again take potentially weeks to read and write these massive gzipped VCF files.
This is where Sentieon’s ‘split_by_sample’ option came in handy. Although we wanted all samples together in the final VCF and so should not have needed this option, by using it to output all sample information in a single ‘samples’ file (containing only the sample IDs and call information, that is, all columns 10-onwards in the VCF), we were also able to produce a ‘main’ file for each chromosome. The ‘samples’ file for each chromosome is quite large since it contains all calls for all samples, however the ‘main’ file is at most a few hundred MBs and contains only columns 1–9 (Figure 2b). This file is all that is needed to perform VQSR, and since it is so much smaller, a process that may have taken weeks to complete could be performed in less than a day.
The ‘main’ file for all chromosomes was combined into a single VCF-like file that contained columns 1–9 for the entire genome; VQSR was performed on this file; finally, the full-genome ‘recalibrated main’ VCF file was split once more by chromosome, to output one ‘recalibrated main’ file per chromosome. These ‘recalibrated main’ files were combined with their respective ‘samples’ files for each chromosome using Sentieon’s extraction script in order to produce a single recalibrated VCF for each chromosome, each of which includes all samples (Figure 2c). Although this extraction process was still bound by the speed of the gzip/gunzip process, it was again able to be performed in parallel across chromosomes to reduce the total runtime needed.
Joint-genotyping the MSSNG cohort was an intensive effort that involved a collaboration between DNAstack, Sentieon, and MSSNG researchers. Sentieon’s CCDG-compliant algorithms allowed for quick, reproducible results that will support straightforward expansion in the future. The speed of the gzip process, as well as the size of output files, required creative solutions to common problems in order to improve speed and workflow costs; we look forward to further improvements in optimizing these processes for large cohorts to help accelerate research.
References
- Freed, D., Aldana, R., Weber, J.A. and Edwards, J.S. 2017. The Sentieon Genomics Tools — A fast and accurate solution to variant calling from next-generation sequence data. bioRxiv. doi: https://doi.org/10.1101/115717
- Regier, A.A., Farjoun, Y., Larson, D.E., Krasheninina, O., Kang, H.M., Howrigan, D.P., Chen, B-J., Kher, M., Banks, E., Ames, D.C., English, A.C., Li, H., Xing, J., Zhang, Y., Matise, T., Abecasis, G.R., Salerno, W., Zody, M.C., Neale, B.M. & Hall, I.M. 2018. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nature Communications. 9: 4038. doi: https://doi.org/10.1038/s41467-018-06159-4
- htslib: https://github.com/samtools/htslib
- bcftools: https://github.com/samtools/bcftools