Scaling-up NGS secondary analysis using field programmable ...€¦ · Cloud services provide the...

7
84 ext-generation sequencing (NGS) holds the promise of tailoring diagnosis and treatment to individuals based on their genetic makeup. As the price of sequencing drops, it becomes feasible to use NGS routinely for clinical applications. One of the obstacles to the widespread adoption of NGS is the time and cost required to process the large amounts of data generated by a sequencing instrument. Whole genome sequencing (WGS) produces hundreds of millions of individual sequence reads. Making sense of the data requires secondary NGS analysis, which entails mapping and aligning the reads to a reference genome and identifying nucleotide variants that differ between the sample and reference. Because humans are 99.9% identical at the genome level, it is the 0.1% variation that explains many of our individual traits, including our susceptibility to diseases. Scaling-up NGS secondary analysis using field programmable gate arrays in the cloud by Adam J Schindler, Ph.D., Lifeng Tian, Ph.D., PhD, Hakon Hakonarson, MD Ph.D., and Rami Mehio, M.Eng. N

Transcript of Scaling-up NGS secondary analysis using field programmable ...€¦ · Cloud services provide the...

Page 1: Scaling-up NGS secondary analysis using field programmable ...€¦ · Cloud services provide the hardware but not the software to perform NGS analysis. ... How to Analyze 1000 Genomes

84

ext-generation sequencing (NGS) holds the promise of tailoring

diagnosis and treatment to individuals based on their genetic

makeup. As the price of sequencing drops, it becomes feasible to

use NGS routinely for clinical applications. One of the obstacles to the

widespread adoption of NGS is the time and cost required to process the

large amounts of data generated by a sequencing instrument. Whole

genome sequencing (WGS) produces hundreds of millions of individual

sequence reads. Making sense of the data requires secondary NGS analysis,

which entails mapping and aligning the reads to a reference genome

and identifying nucleotide variants that differ between the sample and

reference. Because humans are 99.9% identical at the genome level, it is

the 0.1% variation that explains many of our individual traits, including

our susceptibility to diseases.

Scaling-up NGS secondary analysis using field programmable gate arrays in the cloudby Adam J Schindler, Ph.D., Lifeng Tian, Ph.D., PhD, Hakon Hakonarson, MD Ph.D.,

and Rami Mehio, M.Eng.

N

Page 2: Scaling-up NGS secondary analysis using field programmable ...€¦ · Cloud services provide the hardware but not the software to perform NGS analysis. ... How to Analyze 1000 Genomes

Due to the massive amount of sequence data,

secondary analysis is a compute-intensive

process, requiring more than 500 CPU core

hours to analyze a single WGS1. Researchers

studying the links between genomic variants

and diseases typically sequence hundreds to

thousands of individuals, making such studies

long and expensive. And clinics that wish to

use NGS for personalized medicine must set up

and maintain large-scale computing systems or

contract with outside vendors to perform the

service. To save time and money, many NGS

users opt for exome sequencing or targeted

panels. These approaches provide only a small

picture of a genome and may miss potentially

important variants in regulatory regions.

Several diseases, including schizophrenia,

diabetes, neuropathy, and prostate cancer

are linked to variants in non-coding regions2.

The most comprehensive variant data is

obtained from the complete genome, and

current large-scale sequencing projects have

opted to use WGS3,4. The increasing use

of WGS requires faster secondary analysis

solutions to keep up with demand.

This article will address ways to reduce the

time required to perform WGS secondary

analysis and to scale-up sample throughput

using field programmable gate arrays (FPGAs)

operating in the cloud. FPGAs are logic circuits

that are more efficient and faster than

conventional CPUs when customized for

genome analysis, while cloud service

providers such as Amazon Web Services

(AWS) have large data centers with the capacity

to process many samples in parallel. We

highlight a recent demonstration in which

secondary analysis was performed on 1000

whole human genome sequences in 2 hours

25 minutes using FPGAs on the AWS Cloud

platform. We also address the costs of NGS

analysis, which are now low enough for WGS

to be used routinely for diagnoses.

FPGAs Rapidly Analyze Genome Sequences

It requires a considerable amount of computing

power to map and align sequence reads to a

reference genome and identify nucleotide

variants in the sample. Analysis of a WGS

by BWA-GATK, one of the most widely used

secondary analysis pipelines, requires over 30

hours when running on a 36-core server5. This

type of machine is much more powerful than

a desktop PC, which typically has 2 or 4 cores,

and costs far more. Given the time and expense

required to analyze one genome, is easy to see

that a clinic wishing to perform WGS analysis

in-house will need a substantial infrastructure

investment.

An alternative to CPUs are FPGAs, reconfigur-

able hardware that can be customized for

specific applications. The main difference

between CPUs and FPGAs lies in how they

process code: CPUs are sequential processors

that execute algorithms line-by-line, whereas

FPGAs have logic circuits that process an

entire algorithm at once. A single FPGA chip

contains thousands of individual logic circuits

that run in parallel, providing further process

acceleration. The speed improvement of FPGAs

over CPUs is striking: The same WGS that takes

about 30 hours to analyze from raw reads to

called variants on a CPU can be processed in

about 22 minutes using an FPGA6. San

Diego-based Edico Genome was the first

company to use FPGAs for analysis, and Intel’s

BigStack 2.0 genomics computing system

includes FPGAs to handle some of its

processing.

Computing Resources for Rent in the Cloud

Setting up an onsite computing system for NGS

analysis requires a significant infrastructure

investment and will incur ongoing mainte-

nance and operational expenses. An alternative

approach to onsite analysis is to perform the

analysis in the cloud. Cloud service providers

have large data centers with several kinds of

processors (CPUs, FPGAs, GPUs, etc.) that are

available for rent. Many of the biggest players

in technology, including Amazon, Microsoft,

Google, and Alibaba, have developed cloud

computing platforms. The great advantage of

the cloud is that all users—from small clinics

to large research centers—have access to the

same computing resources without a need for

large upfront investments.

Cloud services provide the hardware but not

the software to perform NGS analysis. Users

can set up their own workflow in the cloud

using AWS Marketplace, the storefront for AWS,

or they can use a third-party platform that

offers apps for NGS data analysis and manage-

ment. The main companies providing these

services are BaseSpace (operated by Illumina),

DNAnexus, and Seven Bridges. Users select an

app from one of the providers via a web portal

and upload their data, which is processed on

computing resources provisioned from AWS.

In addition to Amazon, Google is also active

in big data genomics, teaming with the Broad

Institute to make GATK available on the Google

Cloud Platform (GCP), and releasing its own

variant caller, DeepVariant, as open-source

code.

How to Analyze 1000 Genomes before Lunch

A few years ago, it took about two or three

days to perform secondary analysis on a WGS,

imposing a bottleneck on the growth of NGS.

Even large computing centers with thousands

of cores could only process a limited number

of genomes per day. The introduction of

FPGAs for genome analysis and the ability

to outsource computing to the cloud has

dramatically transformed the field. To

demonstrate the full capabilities currently

available for secondary analysis, Edico

Genome collaborated with AWS and the

Children’s Hospital of Philadelphia (CHOP)

to analyze 1000 whole human genomes

simultaneously, in a demonstration at the

American Society for Human Genetics annual

meeting in October 2017.

85

PR

EC

ISIO

N T

OO

LS

Page 3: Scaling-up NGS secondary analysis using field programmable ...€¦ · Cloud services provide the hardware but not the software to perform NGS analysis. ... How to Analyze 1000 Genomes

86

Detection of disease-associated variations in DNA methylation profiles holds significant potential for diagnostic and research applications. Analysis can however be problematic; samples are often sparse and degraded, making analysis using platforms such as microarrays challenging. This frequently compromises generation of valuable information. To overcome this we have developed ‘MS-MIMIC’ to identify then reliably analyse DNA methylation signatures with robust clinically-applicable assays. The technique has now been successfully applied to molecular subgrouping of paediatric medulloblastoma cases.

Rapid and reliable detection of medulloblastoma-associated DNA methylation patterns: MS-MIMIC

ON DEMAND WEBINAR

Sponsored by

R E G I S T E R N O Wwww.jpmagenawebinar.com

Presented by Ed Schwalbe, Ph.D Senior Lecturer in Bioinformatics and Biostatistics, Northumbria University, Newcastle upon Tyne, UK.

Ed has worked on the childhoodbrain tumour medulloblastoma for >10 years. He led the development of the MS-MIMC assay for the routine assessment of diagnostic DNA methylation signatures. His research interests include the development of translational assays that bring research findings into routine clinical use.

www.thejournalofprecisionmedicine.com Organized by The Journal of Precision Medicine

www.agenabio.com

When you consider that it took days to analyse a genome sequence a few years ago, the scale - up in capability to one thousand genomes in a couple of hours is quite remarkable

The Center for Applied Genomics at CHOP has

undertaken the PediSeq project to sequence

diverse pediatric genomes with the goal of

identifying genetic variants underlying

childhood diseases. They performed WGS using

Illumina HiSeq machines and provided Edico

Genome with raw read data for 1000 genomes

in FASTQ file format, which were uploaded to

Amazon S3, AWS’s data storage service. The

1000 genomes had an average coverage depth

of 39x and comprised 812 billion sequence

reads totaling 64.3 TB of data. To analyze the

samples, Edico Genome used 1000 f1.2xlarge

Amazon EC2 instances, which are virtual FPGA

servers in the AWS Cloud located in the

us-east-1 AWS region. Edico Genome’s

DRAGEN NGS analysis pipeline was

deployed on each of the F1 instances, and

the analysis was initiated by the DRAGEN

Workflow Management System, which specified

the processes to run, and AWS Batch, a

program for executing batch processes on

Page 4: Scaling-up NGS secondary analysis using field programmable ...€¦ · Cloud services provide the hardware but not the software to perform NGS analysis. ... How to Analyze 1000 Genomes

people9, and China’s goal of sequencing 1

million residents of Jiangsu province in two

years10. The worldwide sequencing capacity

was around 1 million genomes per year in 2017,

with estimates that this number will increase

to 1 billion per year within 10 years11. Attaining

such extraordinary capacity is only possible

with scalable secondary analysis solutions.

Perhaps the biggest push to increase the

throughput of sequencing and analysis will

come from clinics worldwide. At present,

WGS is not routinely used as a diagnostic tool,

except to identify some monogenic diseases

and cancer drivers. Most diseases that afflict

individuals cannot easily be linked to specific

sequence variants, limiting the prognosticative

and diagnostic usefulness of WGS. This is

changing, however, as researchers with access

to large troves of genomic information and the

supercomputers to analyze the data are finding

sets of variants that increase disease likelihood.

Recent work found genomic variants associated

with Parkinson’s disease12, Alzheimer’s

disease13, type 2 diabetes14, breast cancer15,

ovarian cancer16, and obesity17. It should be

noted that these associations were typically

found through targeted sequencing of a subset

of known single nucleotide polymorphisms

(SNPs). Analyzing WGS that include all known

SNPs, in addition to nucleotide insertions

and deletions (INDELs) and copy number

variants, will produce a more complete picture

of genomic variants associated with specific

traits and diseases. It seems plausible that

within a few years WGS will be able to provide

important information on a wide range

of diseases, making it more useful and

economical than assays that diagnose only

a single disease. 87

PR

EC

ISIO

N T

OO

LS

instances. To perform the analysis, FASTQ

files were retrieved from S3 and loaded onto

an instance, where DRAGEN performed

mapping, aligning, sorting, duplicate marking,

and variant calling, and finished by uploading

variant call files (VCFs) back to S3. At its peak,

the 1000 F1 instance cluster was processing 143

GBs per second. The average time to analyze a

genome was 1 hour 51 minutes, and the total

time to complete 1000 genomes was 2 hours

25 minutes. Although accuracy metrics were

not performed on the data, the same DRAGEN

pipeline used in this demonstration has proven

to be one of the most accurate variant callers7,

indicating that speed does not come at the

expense of accuracy. In-depth technical details

on the procedures and codes to perform the

analysis are available in an AWS blog post8.

When you consider that it took days to analyze

a single genome sequence a few years ago,

the scale-up in capability to one thousand

genomes in a couple hours is quite remarkable.

The Benefits of Fast and Scalable Analysis

The demonstration by Edico Genome, AWS,

and CHOP shows that secondary analysis no

longer imposes a bottleneck on NGS, and in

fact, is one of the fastest aspect of NGS. The

speed of analysis can be an important factor in

some cases, such as pregnancy complications

or sick newborns, situations in which getting

a diagnosis in hours instead of days may save

a life. Speed is also important from a cost

perspective, since AWS and other cloud

providers typically charge based on time.

The current on-demand price for an FPGA

or 36-core CPU instance on AWS are similar,

so running a process in two hours will save

over running it in 30 hours.

The throughput achieved in the demonstration

is the truly striking feature, as it opens the

possibility of massively accelerating NGS.

Several ongoing projects seek to generate

enormous databases of WGS data, such as

Genomics England’s 100,000 Genomes Project3,

the US Precision Medicine Initiative to gather

genetic and health information on 1 million

Page 5: Scaling-up NGS secondary analysis using field programmable ...€¦ · Cloud services provide the hardware but not the software to perform NGS analysis. ... How to Analyze 1000 Genomes

52

Bryce Olspn, Ph.D, is an independent science writer

and editor based in western Massachusetts.

‘ The breadth of topics here are enormous. This conference has

done a really nice job of putting together a story in

one hour excerpts - a conference like this really helps move

the ball forward ’.

Ralph Riley mba.

Co Dx Market Access Leader for the Co Diagnostic

Commercial Strategy Group, Janssen Global Services

JUNE 26-28 , 2018 Hyatt Regency Jersey City on the Hudson

3 r d A N N U A L P R E C I S I O N M E D I C I N E L E A D E R S S U M M I T

88

Page 6: Scaling-up NGS secondary analysis using field programmable ...€¦ · Cloud services provide the hardware but not the software to perform NGS analysis. ... How to Analyze 1000 Genomes

89

PR

EC

ISIO

N T

OO

LS

this space is intense, which is driving speed

and accuracy up and costs down. At present,

analyzing a WGS from FASTQ to VCF can be

done for $15 through BaseSpace and DNAnexus,

and the cost can be even lower if setting up

a workflow directly on AWS. There are three

pricing models for AWS instances: On-demand,

which reserves an instance for an open-ended

amount of time and cannot be terminated;

spot, whose price is set by the market and is

usually >50% lower than on-demand, but can be

can be terminated if the market price exceeds

the paid price; and reserved, which requires

a 1- or 3-year commitment but locks in low

instance prices. Other cloud service providers

use similar pricing models to AWS. In addition

to the low cost, another positive of the

Representatives from Edico Genome accept the Guiness World Record for the fastest analysis of 1000 whole human genomes at the American Society of Human Genetics annual meeting in October 2017

Caption: Representatives from Edico Genome accept the Guinness World Record for the fastest analysis of 1000 whole human genomes at the American Society of Human Genetics annual meeting in October 2017.

Assessing the Cost of Sequencing and

Analyzing a WGS

Clinics will only move to NGS when it makes

economic sense to do so. By far the largest

expense associated with NGS comes from

preparing the libraries and performing the

sequencing, which is estimated at around

$1000 per genome18. Prices have dropped

in the last few years, and the introduction

of Illumina’s high-throughput NovaSeq

instrument is likely to drop costs even more.

Illumina has a stated goal of reducing the cost

per genome to $10019, which underscores their

belief that considerable improvement is still

possible in sequencing technology.

By comparison with sequencing, bioinformatic

secondary analysis is cheap. Competition in

competition in secondary analysis is the

ease of use. BaseSpace and DNAnexus simplify

the process by performing the analysis behind

the curtain, only requiring users to upload

data and set their parameters. Edico Genome

also has their DRAGEN pipeline as a direct

application on AWS that can be run with a

few mouse clicks.

The final step in going from biological

sample to useful genomic information is

tertiary analysis—taking the called variants

and interpreting their meaning. This is the

most complicated area of NGS analysis because

new research must continuously be integrated

into the interpretations. Two of the players

in this area are Fabric Genomics and Qiagen,

which both offer VCF analysis tools. We are

still at the early stages of connecting diseases

to specific variants, and it is likely that as more

data emerge and clinics incorporate NGS into

their diagnosis and treatment strategies, we

will see increased competition and downward

price pressure in tertiary analysis solutions.

Conclusions

The demonstration by CHOP, Edico Genome,

and AWS in analyzing 1000 genomes in under

two and half hours presents a blueprint for

the future growth of NGS: A collaboration

between a hospital, genomics platform

developer, and cloud computing provider

that takes advantage of the strengths of the

three organizations to improve the speed and

throughput of NGS secondary analysis while

lowering costs. The demonstration has

real-world benefits, as CHOP now has 1000

variant-called sequences to help in the

discovery of genomic links to pediatric

diseases. The next decade will see dramatic

growth of sequencing and increasing reliance

on NGS in the clinic for diagnosis and

treatment strategies. The combination of

fast FPGAs and scalable cloud computing

will enable organizations to keep up with

even the largest sequencing demands.

Page 7: Scaling-up NGS secondary analysis using field programmable ...€¦ · Cloud services provide the hardware but not the software to perform NGS analysis. ... How to Analyze 1000 Genomes

90

Now accepting entriesDo you have a genetic counselor peer who goes above and beyond in their quest to assist patients and their families as they try to make very difficult treatment decisions?

Share your story and nominate your peer for the 2018 Code Talker Award!

The Code Talker Award is an opportunity to pay tribute to a genetic counselor by nominating him or her for an award to be presented at the National Society of Genetic Counselors’ annual education conference in Atlanta this November.

Nominations can be submitted online at genomemag.com/codetalker.

Presented by Sponsored by

Announcing the 2018 Code Talker Award

18spring_JPM_ad.indd 1 3/13/18 11:16 AM

Adam J Schindler, Ph.D, Technical Writer, Edico Genome

Lifeng Tian, Ph.D,, Bioinformatics Scientist, Center for

Applied Genomics, Children’s Hospital of Philadelphia

Hakon Hakonarson, MD, PhD, Director, Center for

Applied Genomics, Children’s Hospital of Philadelphia

Rami Mehio, M.Eng, Vice President, Engineering,

Edico Genome

References

1. https://blog.dnanexus.com/2018-01-16-evaluating-the-performance-of-ngs-pipelines-on-noisy-wgs-data/2. Zhang, F. & Lupski, J.R. 2015. Non-coding genetic variants in human disease. Hum Mol Genet. Oct 15; 24(R1): R102–R110.3. https://www.genomicsengland.co.uk/4. https://www.nhlbiwgs.org/5. https://gatkforums.broadinstitute.org/gatk/discus-sion/7249/how-long-does-it-take-to-run-the-gatk-best-practices6. http://edicogenome.com/wp-content/uploads/2014/10/Edico-Genome-Rady-Childrens-White-Paper-March-2017.pdf7. https://precision.fda.gov/challenges/1/view/results8. https://aws.amazon.com/blogs/compute/accelerating-precision-medicine-at-scale9. https://allofus.nih.gov/about/program-faq10. https://futurism.com/chinese-province-sequencing-1-million-residents-genomes/11. http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.100219512. https://jamanetwork.com/journals/jamaneurology/article-abstract/266992213. https://jamanetwork.com/journals/jamaneurology/article-abstract/266992114. https://www.nature.com/articles/ng.394315. https://www.nature.com/articles/ng.378516. https://www.nature.com/articles/ng.382617. https://www.nature.com/articles/nature141771.8 https://www.genome.gov/sequencingcostsdata/19. https://www.forbes.com/sites/matthewherp-er/2017/01/09/illumina-promises-to-sequence-human-genome-for-100-but-not-quite-yet/#f62dbc7386d2

Advertising Index

Agena Bioscience, Inc. Page 63

Curematch, Inc. Page 50

Children’s Tumor Foundation Page 71

EPEMED Page 35

Genome IBC

GNS Healthcare Page 46

Kieran Doherty, Photojournalist IFC, Page 3

Medidata Solutions, Inc. Page 14

Menarini Silicon Biosystems, Inc Page 19

Mission Bio Page 40

PMC Page 13

Qiagen Page 4

SomaLogic, Inc. Page 6

Taconic Biosciences, Inc Page 53

Waters Corp BC

NEXTGENPCR introduces a new way to heat and cool the samples instantly, with virtually no ramp rates, losing no time getting the samples to the desired temperature. The

user can go from melting to annealing in less than 0.1 second. For a 100 base pair fragment, total reaction time is as low as 2 minutes, with 700 base pairs being amplified in 10 minutes.

Adding NEXTGENPCR into workflows results in higher sample throughput. Using optimized applications, more than 10 full plate experiments per hour become possible. Operated at

capacity this would mean 80, 384-well plates per day, resulting in more than 30,000 datapoints per day; far beyond the

capabilities of common thermocyclers.

2-minute PCR is Here.NEXTGENPCR; An Entirely New Ultra-fast Technology for Heating and Cooling of Samples

ON DEMAND WEBINAR

Sponsored by

Gert de Vos∂Director, Molecular Biology Systems B.V.

R E G I S T E R N O Wwww.jpmcanonwebinar.com

Presented by Gert de VosMolecular Biology Systems B.V.

During this webinar special attention will be paid to publications detailing 2 minute PCR data and how NEXTGENPCR amplified all 29 fragments of the BRCA1 gene in less than 10 minutes for subsequent Sanger sequencing.

www.thejournalofprecisionmedicine.com www.canon-biomedical.com

May 31, 10.00 am ET