Scaling-up NGS secondary analysis using field programmable ...€¦ · Cloud services provide the...
Transcript of Scaling-up NGS secondary analysis using field programmable ...€¦ · Cloud services provide the...
84
ext-generation sequencing (NGS) holds the promise of tailoring
diagnosis and treatment to individuals based on their genetic
makeup. As the price of sequencing drops, it becomes feasible to
use NGS routinely for clinical applications. One of the obstacles to the
widespread adoption of NGS is the time and cost required to process the
large amounts of data generated by a sequencing instrument. Whole
genome sequencing (WGS) produces hundreds of millions of individual
sequence reads. Making sense of the data requires secondary NGS analysis,
which entails mapping and aligning the reads to a reference genome
and identifying nucleotide variants that differ between the sample and
reference. Because humans are 99.9% identical at the genome level, it is
the 0.1% variation that explains many of our individual traits, including
our susceptibility to diseases.
Scaling-up NGS secondary analysis using field programmable gate arrays in the cloudby Adam J Schindler, Ph.D., Lifeng Tian, Ph.D., PhD, Hakon Hakonarson, MD Ph.D.,
and Rami Mehio, M.Eng.
N
Due to the massive amount of sequence data,
secondary analysis is a compute-intensive
process, requiring more than 500 CPU core
hours to analyze a single WGS1. Researchers
studying the links between genomic variants
and diseases typically sequence hundreds to
thousands of individuals, making such studies
long and expensive. And clinics that wish to
use NGS for personalized medicine must set up
and maintain large-scale computing systems or
contract with outside vendors to perform the
service. To save time and money, many NGS
users opt for exome sequencing or targeted
panels. These approaches provide only a small
picture of a genome and may miss potentially
important variants in regulatory regions.
Several diseases, including schizophrenia,
diabetes, neuropathy, and prostate cancer
are linked to variants in non-coding regions2.
The most comprehensive variant data is
obtained from the complete genome, and
current large-scale sequencing projects have
opted to use WGS3,4. The increasing use
of WGS requires faster secondary analysis
solutions to keep up with demand.
This article will address ways to reduce the
time required to perform WGS secondary
analysis and to scale-up sample throughput
using field programmable gate arrays (FPGAs)
operating in the cloud. FPGAs are logic circuits
that are more efficient and faster than
conventional CPUs when customized for
genome analysis, while cloud service
providers such as Amazon Web Services
(AWS) have large data centers with the capacity
to process many samples in parallel. We
highlight a recent demonstration in which
secondary analysis was performed on 1000
whole human genome sequences in 2 hours
25 minutes using FPGAs on the AWS Cloud
platform. We also address the costs of NGS
analysis, which are now low enough for WGS
to be used routinely for diagnoses.
FPGAs Rapidly Analyze Genome Sequences
It requires a considerable amount of computing
power to map and align sequence reads to a
reference genome and identify nucleotide
variants in the sample. Analysis of a WGS
by BWA-GATK, one of the most widely used
secondary analysis pipelines, requires over 30
hours when running on a 36-core server5. This
type of machine is much more powerful than
a desktop PC, which typically has 2 or 4 cores,
and costs far more. Given the time and expense
required to analyze one genome, is easy to see
that a clinic wishing to perform WGS analysis
in-house will need a substantial infrastructure
investment.
An alternative to CPUs are FPGAs, reconfigur-
able hardware that can be customized for
specific applications. The main difference
between CPUs and FPGAs lies in how they
process code: CPUs are sequential processors
that execute algorithms line-by-line, whereas
FPGAs have logic circuits that process an
entire algorithm at once. A single FPGA chip
contains thousands of individual logic circuits
that run in parallel, providing further process
acceleration. The speed improvement of FPGAs
over CPUs is striking: The same WGS that takes
about 30 hours to analyze from raw reads to
called variants on a CPU can be processed in
about 22 minutes using an FPGA6. San
Diego-based Edico Genome was the first
company to use FPGAs for analysis, and Intel’s
BigStack 2.0 genomics computing system
includes FPGAs to handle some of its
processing.
Computing Resources for Rent in the Cloud
Setting up an onsite computing system for NGS
analysis requires a significant infrastructure
investment and will incur ongoing mainte-
nance and operational expenses. An alternative
approach to onsite analysis is to perform the
analysis in the cloud. Cloud service providers
have large data centers with several kinds of
processors (CPUs, FPGAs, GPUs, etc.) that are
available for rent. Many of the biggest players
in technology, including Amazon, Microsoft,
Google, and Alibaba, have developed cloud
computing platforms. The great advantage of
the cloud is that all users—from small clinics
to large research centers—have access to the
same computing resources without a need for
large upfront investments.
Cloud services provide the hardware but not
the software to perform NGS analysis. Users
can set up their own workflow in the cloud
using AWS Marketplace, the storefront for AWS,
or they can use a third-party platform that
offers apps for NGS data analysis and manage-
ment. The main companies providing these
services are BaseSpace (operated by Illumina),
DNAnexus, and Seven Bridges. Users select an
app from one of the providers via a web portal
and upload their data, which is processed on
computing resources provisioned from AWS.
In addition to Amazon, Google is also active
in big data genomics, teaming with the Broad
Institute to make GATK available on the Google
Cloud Platform (GCP), and releasing its own
variant caller, DeepVariant, as open-source
code.
How to Analyze 1000 Genomes before Lunch
A few years ago, it took about two or three
days to perform secondary analysis on a WGS,
imposing a bottleneck on the growth of NGS.
Even large computing centers with thousands
of cores could only process a limited number
of genomes per day. The introduction of
FPGAs for genome analysis and the ability
to outsource computing to the cloud has
dramatically transformed the field. To
demonstrate the full capabilities currently
available for secondary analysis, Edico
Genome collaborated with AWS and the
Children’s Hospital of Philadelphia (CHOP)
to analyze 1000 whole human genomes
simultaneously, in a demonstration at the
American Society for Human Genetics annual
meeting in October 2017.
85
PR
EC
ISIO
N T
OO
LS
86
Detection of disease-associated variations in DNA methylation profiles holds significant potential for diagnostic and research applications. Analysis can however be problematic; samples are often sparse and degraded, making analysis using platforms such as microarrays challenging. This frequently compromises generation of valuable information. To overcome this we have developed ‘MS-MIMIC’ to identify then reliably analyse DNA methylation signatures with robust clinically-applicable assays. The technique has now been successfully applied to molecular subgrouping of paediatric medulloblastoma cases.
Rapid and reliable detection of medulloblastoma-associated DNA methylation patterns: MS-MIMIC
ON DEMAND WEBINAR
Sponsored by
R E G I S T E R N O Wwww.jpmagenawebinar.com
Presented by Ed Schwalbe, Ph.D Senior Lecturer in Bioinformatics and Biostatistics, Northumbria University, Newcastle upon Tyne, UK.
Ed has worked on the childhoodbrain tumour medulloblastoma for >10 years. He led the development of the MS-MIMC assay for the routine assessment of diagnostic DNA methylation signatures. His research interests include the development of translational assays that bring research findings into routine clinical use.
www.thejournalofprecisionmedicine.com Organized by The Journal of Precision Medicine
www.agenabio.com
When you consider that it took days to analyse a genome sequence a few years ago, the scale - up in capability to one thousand genomes in a couple of hours is quite remarkable
The Center for Applied Genomics at CHOP has
undertaken the PediSeq project to sequence
diverse pediatric genomes with the goal of
identifying genetic variants underlying
childhood diseases. They performed WGS using
Illumina HiSeq machines and provided Edico
Genome with raw read data for 1000 genomes
in FASTQ file format, which were uploaded to
Amazon S3, AWS’s data storage service. The
1000 genomes had an average coverage depth
of 39x and comprised 812 billion sequence
reads totaling 64.3 TB of data. To analyze the
samples, Edico Genome used 1000 f1.2xlarge
Amazon EC2 instances, which are virtual FPGA
servers in the AWS Cloud located in the
us-east-1 AWS region. Edico Genome’s
DRAGEN NGS analysis pipeline was
deployed on each of the F1 instances, and
the analysis was initiated by the DRAGEN
Workflow Management System, which specified
the processes to run, and AWS Batch, a
program for executing batch processes on
people9, and China’s goal of sequencing 1
million residents of Jiangsu province in two
years10. The worldwide sequencing capacity
was around 1 million genomes per year in 2017,
with estimates that this number will increase
to 1 billion per year within 10 years11. Attaining
such extraordinary capacity is only possible
with scalable secondary analysis solutions.
Perhaps the biggest push to increase the
throughput of sequencing and analysis will
come from clinics worldwide. At present,
WGS is not routinely used as a diagnostic tool,
except to identify some monogenic diseases
and cancer drivers. Most diseases that afflict
individuals cannot easily be linked to specific
sequence variants, limiting the prognosticative
and diagnostic usefulness of WGS. This is
changing, however, as researchers with access
to large troves of genomic information and the
supercomputers to analyze the data are finding
sets of variants that increase disease likelihood.
Recent work found genomic variants associated
with Parkinson’s disease12, Alzheimer’s
disease13, type 2 diabetes14, breast cancer15,
ovarian cancer16, and obesity17. It should be
noted that these associations were typically
found through targeted sequencing of a subset
of known single nucleotide polymorphisms
(SNPs). Analyzing WGS that include all known
SNPs, in addition to nucleotide insertions
and deletions (INDELs) and copy number
variants, will produce a more complete picture
of genomic variants associated with specific
traits and diseases. It seems plausible that
within a few years WGS will be able to provide
important information on a wide range
of diseases, making it more useful and
economical than assays that diagnose only
a single disease. 87
PR
EC
ISIO
N T
OO
LS
instances. To perform the analysis, FASTQ
files were retrieved from S3 and loaded onto
an instance, where DRAGEN performed
mapping, aligning, sorting, duplicate marking,
and variant calling, and finished by uploading
variant call files (VCFs) back to S3. At its peak,
the 1000 F1 instance cluster was processing 143
GBs per second. The average time to analyze a
genome was 1 hour 51 minutes, and the total
time to complete 1000 genomes was 2 hours
25 minutes. Although accuracy metrics were
not performed on the data, the same DRAGEN
pipeline used in this demonstration has proven
to be one of the most accurate variant callers7,
indicating that speed does not come at the
expense of accuracy. In-depth technical details
on the procedures and codes to perform the
analysis are available in an AWS blog post8.
When you consider that it took days to analyze
a single genome sequence a few years ago,
the scale-up in capability to one thousand
genomes in a couple hours is quite remarkable.
The Benefits of Fast and Scalable Analysis
The demonstration by Edico Genome, AWS,
and CHOP shows that secondary analysis no
longer imposes a bottleneck on NGS, and in
fact, is one of the fastest aspect of NGS. The
speed of analysis can be an important factor in
some cases, such as pregnancy complications
or sick newborns, situations in which getting
a diagnosis in hours instead of days may save
a life. Speed is also important from a cost
perspective, since AWS and other cloud
providers typically charge based on time.
The current on-demand price for an FPGA
or 36-core CPU instance on AWS are similar,
so running a process in two hours will save
over running it in 30 hours.
The throughput achieved in the demonstration
is the truly striking feature, as it opens the
possibility of massively accelerating NGS.
Several ongoing projects seek to generate
enormous databases of WGS data, such as
Genomics England’s 100,000 Genomes Project3,
the US Precision Medicine Initiative to gather
genetic and health information on 1 million
52
Bryce Olspn, Ph.D, is an independent science writer
and editor based in western Massachusetts.
‘ The breadth of topics here are enormous. This conference has
done a really nice job of putting together a story in
one hour excerpts - a conference like this really helps move
the ball forward ’.
Ralph Riley mba.
Co Dx Market Access Leader for the Co Diagnostic
Commercial Strategy Group, Janssen Global Services
JUNE 26-28 , 2018 Hyatt Regency Jersey City on the Hudson
3 r d A N N U A L P R E C I S I O N M E D I C I N E L E A D E R S S U M M I T
88
89
PR
EC
ISIO
N T
OO
LS
this space is intense, which is driving speed
and accuracy up and costs down. At present,
analyzing a WGS from FASTQ to VCF can be
done for $15 through BaseSpace and DNAnexus,
and the cost can be even lower if setting up
a workflow directly on AWS. There are three
pricing models for AWS instances: On-demand,
which reserves an instance for an open-ended
amount of time and cannot be terminated;
spot, whose price is set by the market and is
usually >50% lower than on-demand, but can be
can be terminated if the market price exceeds
the paid price; and reserved, which requires
a 1- or 3-year commitment but locks in low
instance prices. Other cloud service providers
use similar pricing models to AWS. In addition
to the low cost, another positive of the
Representatives from Edico Genome accept the Guiness World Record for the fastest analysis of 1000 whole human genomes at the American Society of Human Genetics annual meeting in October 2017
Caption: Representatives from Edico Genome accept the Guinness World Record for the fastest analysis of 1000 whole human genomes at the American Society of Human Genetics annual meeting in October 2017.
Assessing the Cost of Sequencing and
Analyzing a WGS
Clinics will only move to NGS when it makes
economic sense to do so. By far the largest
expense associated with NGS comes from
preparing the libraries and performing the
sequencing, which is estimated at around
$1000 per genome18. Prices have dropped
in the last few years, and the introduction
of Illumina’s high-throughput NovaSeq
instrument is likely to drop costs even more.
Illumina has a stated goal of reducing the cost
per genome to $10019, which underscores their
belief that considerable improvement is still
possible in sequencing technology.
By comparison with sequencing, bioinformatic
secondary analysis is cheap. Competition in
competition in secondary analysis is the
ease of use. BaseSpace and DNAnexus simplify
the process by performing the analysis behind
the curtain, only requiring users to upload
data and set their parameters. Edico Genome
also has their DRAGEN pipeline as a direct
application on AWS that can be run with a
few mouse clicks.
The final step in going from biological
sample to useful genomic information is
tertiary analysis—taking the called variants
and interpreting their meaning. This is the
most complicated area of NGS analysis because
new research must continuously be integrated
into the interpretations. Two of the players
in this area are Fabric Genomics and Qiagen,
which both offer VCF analysis tools. We are
still at the early stages of connecting diseases
to specific variants, and it is likely that as more
data emerge and clinics incorporate NGS into
their diagnosis and treatment strategies, we
will see increased competition and downward
price pressure in tertiary analysis solutions.
Conclusions
The demonstration by CHOP, Edico Genome,
and AWS in analyzing 1000 genomes in under
two and half hours presents a blueprint for
the future growth of NGS: A collaboration
between a hospital, genomics platform
developer, and cloud computing provider
that takes advantage of the strengths of the
three organizations to improve the speed and
throughput of NGS secondary analysis while
lowering costs. The demonstration has
real-world benefits, as CHOP now has 1000
variant-called sequences to help in the
discovery of genomic links to pediatric
diseases. The next decade will see dramatic
growth of sequencing and increasing reliance
on NGS in the clinic for diagnosis and
treatment strategies. The combination of
fast FPGAs and scalable cloud computing
will enable organizations to keep up with
even the largest sequencing demands.
90
Now accepting entriesDo you have a genetic counselor peer who goes above and beyond in their quest to assist patients and their families as they try to make very difficult treatment decisions?
Share your story and nominate your peer for the 2018 Code Talker Award!
The Code Talker Award is an opportunity to pay tribute to a genetic counselor by nominating him or her for an award to be presented at the National Society of Genetic Counselors’ annual education conference in Atlanta this November.
Nominations can be submitted online at genomemag.com/codetalker.
Presented by Sponsored by
Announcing the 2018 Code Talker Award
18spring_JPM_ad.indd 1 3/13/18 11:16 AM
Adam J Schindler, Ph.D, Technical Writer, Edico Genome
Lifeng Tian, Ph.D,, Bioinformatics Scientist, Center for
Applied Genomics, Children’s Hospital of Philadelphia
Hakon Hakonarson, MD, PhD, Director, Center for
Applied Genomics, Children’s Hospital of Philadelphia
Rami Mehio, M.Eng, Vice President, Engineering,
Edico Genome
References
1. https://blog.dnanexus.com/2018-01-16-evaluating-the-performance-of-ngs-pipelines-on-noisy-wgs-data/2. Zhang, F. & Lupski, J.R. 2015. Non-coding genetic variants in human disease. Hum Mol Genet. Oct 15; 24(R1): R102–R110.3. https://www.genomicsengland.co.uk/4. https://www.nhlbiwgs.org/5. https://gatkforums.broadinstitute.org/gatk/discus-sion/7249/how-long-does-it-take-to-run-the-gatk-best-practices6. http://edicogenome.com/wp-content/uploads/2014/10/Edico-Genome-Rady-Childrens-White-Paper-March-2017.pdf7. https://precision.fda.gov/challenges/1/view/results8. https://aws.amazon.com/blogs/compute/accelerating-precision-medicine-at-scale9. https://allofus.nih.gov/about/program-faq10. https://futurism.com/chinese-province-sequencing-1-million-residents-genomes/11. http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.100219512. https://jamanetwork.com/journals/jamaneurology/article-abstract/266992213. https://jamanetwork.com/journals/jamaneurology/article-abstract/266992114. https://www.nature.com/articles/ng.394315. https://www.nature.com/articles/ng.378516. https://www.nature.com/articles/ng.382617. https://www.nature.com/articles/nature141771.8 https://www.genome.gov/sequencingcostsdata/19. https://www.forbes.com/sites/matthewherp-er/2017/01/09/illumina-promises-to-sequence-human-genome-for-100-but-not-quite-yet/#f62dbc7386d2
Advertising Index
Agena Bioscience, Inc. Page 63
Curematch, Inc. Page 50
Children’s Tumor Foundation Page 71
EPEMED Page 35
Genome IBC
GNS Healthcare Page 46
Kieran Doherty, Photojournalist IFC, Page 3
Medidata Solutions, Inc. Page 14
Menarini Silicon Biosystems, Inc Page 19
Mission Bio Page 40
PMC Page 13
Qiagen Page 4
SomaLogic, Inc. Page 6
Taconic Biosciences, Inc Page 53
Waters Corp BC
NEXTGENPCR introduces a new way to heat and cool the samples instantly, with virtually no ramp rates, losing no time getting the samples to the desired temperature. The
user can go from melting to annealing in less than 0.1 second. For a 100 base pair fragment, total reaction time is as low as 2 minutes, with 700 base pairs being amplified in 10 minutes.
Adding NEXTGENPCR into workflows results in higher sample throughput. Using optimized applications, more than 10 full plate experiments per hour become possible. Operated at
capacity this would mean 80, 384-well plates per day, resulting in more than 30,000 datapoints per day; far beyond the
capabilities of common thermocyclers.
2-minute PCR is Here.NEXTGENPCR; An Entirely New Ultra-fast Technology for Heating and Cooling of Samples
ON DEMAND WEBINAR
Sponsored by
Gert de Vos∂Director, Molecular Biology Systems B.V.
R E G I S T E R N O Wwww.jpmcanonwebinar.com
Presented by Gert de VosMolecular Biology Systems B.V.
During this webinar special attention will be paid to publications detailing 2 minute PCR data and how NEXTGENPCR amplified all 29 fragments of the BRCA1 gene in less than 10 minutes for subsequent Sanger sequencing.
www.thejournalofprecisionmedicine.com www.canon-biomedical.com
May 31, 10.00 am ET