A Survey of NGS Data Analysis on Hadoop

35
Chung-Tsai Su SPN Architect, Core Tech Trend Micro 2014/10/31 @CSIE.NTU Introduction of NGS Data Analysis on Hadoop 1 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc.

Transcript of A Survey of NGS Data Analysis on Hadoop

Page 1: A Survey of NGS Data Analysis on Hadoop

Chung-Tsai Su

SPN Architect, Core Tech

Trend Micro

2014/10/31 @CSIE.NTU

Introduction of NGS Data Analysis on Hadoop

110/31/2014 Confidential | Copyright 2012 Trend Micro Inc.

Page 2: A Survey of NGS Data Analysis on Hadoop

Q&A

10/31/2014 2Confidential | Copyright 2012 Trend Micro Inc.http://setmoney.blob.core.windows.net/newsimages/2014/09/04/136352-XXL.jpg

Page 3: A Survey of NGS Data Analysis on Hadoop

http://www.genome.gov/sequencingcosts/

NGS Era

Page 4: A Survey of NGS Data Analysis on Hadoop

NGS Pipeline

10/31/2014 4Confidential | Copyright 2012 Trend Micro Inc.

Page 5: A Survey of NGS Data Analysis on Hadoop

High-Level Workflow of NGS

10/31/2014 5Confidential | Copyright 2012 Trend Micro Inc.

Read

Mapping

Raw

Reads

(.fq)

Variant

Calling

Sequence

Alignment/

Mapping

(.sam/.bam)

Variant

Calling file

(.vcf)

Page 6: A Survey of NGS Data Analysis on Hadoop

NGS Data Analysis Pipeline

• GATK best practice

10/31/2014 6Confidential | Copyright 2012 Trend Micro Inc.https://www.broadinstitute.org/gatk/guide/best-practices?bpm=DNAseq

Page 7: A Survey of NGS Data Analysis on Hadoop

illumina solution

7http://systems.illumina.com/content/dam/illumina-

marketing/documents/products/brochures/brochure_sequencing_systems_portfolio.pdf

Page 8: A Survey of NGS Data Analysis on Hadoop

The First $1,000 Genome – illumina HiSeq X Ten

10/31/2014 8Confidential | Copyright 2012 Trend Micro Inc.http://systems.illumina.com/systems/hiseq-x-sequencing-system.html

Page 9: A Survey of NGS Data Analysis on Hadoop

Expectation of Data Processing Power for illumina HiSeq X Ten

• A cluster of 10 HiSeq X instruments

• Capable of sequencing up to 18,000 whole human genomes each year

– Has a run cycle of ~3 days and produces ~150 genomes each run cycle

– Running the industry standard BWA+GATK analysis pipeline to perform this analysis on a reasonably high-end (Dual Intel Xeon E5-2697v2 CPU – 12 core, 2.7 GHz with 96 GB DRAM) compute server takes ~24 hours per genome.

– To achieve the required throughput of 150 genomes every three days, at least 50 of these servers are required.

• Should meet a target of ~28 minutes for the completion of the mapping, aligning, sorting, de-duplication and variant calling of each genome.

10/31/2014 9Confidential | Copyright 2012 Trend Micro Inc.http://www.edicogenome.com/dragen/

Page 10: A Survey of NGS Data Analysis on Hadoop

Literature Survey

10/31/2014 10Confidential | Copyright 2012 Trend Micro Inc.

Page 11: A Survey of NGS Data Analysis on Hadoop

Literature

• CloudBurst, 2009

• CloudAligner, 2011

• DistMap, 2013

10/31/2014 11Confidential | Copyright 2012 Trend Micro Inc.

Page 12: A Survey of NGS Data Analysis on Hadoop

10/31/2014 12Confidential | Copyright 2012 Trend Micro Inc.

Page 13: A Survey of NGS Data Analysis on Hadoop

Algorithm of CloudBurst

10/31/2014 13Confidential | Copyright 2012 Trend Micro Inc.

Seed-and-Extend

Algorithm

Page 14: A Survey of NGS Data Analysis on Hadoop

Performance of CloudBurst

EECS$584$–$Fall$2013$

Experiments$

• Scalability+

0

2000

4000

6000

8000

10000

12000

14000

16000

0 1 2 3 4 5 6 7 8

Ru

nti

me (

s)

Millions of Reads

Running Time vs Number of Reads on Chr 1

0 1

2 3

4

10/31/2014 14Confidential | Copyright 2012 Trend Micro Inc.

Page 15: A Survey of NGS Data Analysis on Hadoop

Speedup over Serial RMAP

EECS$584$–$Fall$2013$

Experiments$

• Speedup+over+serial+RMAP+

0

5

10

15

20

25

30

35

40

0 1 2 3 4

Sp

ee

du

p

Number of Mismatches

Speedup over serial RMAP

chr1 chr22

10/31/2014 15Confidential | Copyright 2012 Trend Micro Inc.

Page 16: A Survey of NGS Data Analysis on Hadoop

Speedup on EC2

EECS$584$–$Fall$2013$

Experiments$

• Speedup+on+EC2+

0

200

400

600

800

1000

1200

1400

1600

1800

24 48 72 96

Ru

nn

ing t

ime (

s)

Number of Cores

Running Time on EC2 High-CPU Medium Instance Cluster

10/31/2014 16Confidential | Copyright 2012 Trend Micro Inc.

Page 17: A Survey of NGS Data Analysis on Hadoop

10/31/2014 17Confidential | Copyright 2012 Trend Micro Inc.

Page 18: A Survey of NGS Data Analysis on Hadoop

Overhead of Disk I/O

10/31/2014 18Confidential | Copyright 2012 Trend Micro Inc.

Page 19: A Survey of NGS Data Analysis on Hadoop

Architecture of CloudAligner

10/31/2014 19Confidential | Copyright 2012 Trend Micro Inc.

Seed-and-Extend

Algorithm

Page 20: A Survey of NGS Data Analysis on Hadoop

Performance on Small Data

10/31/2014 20Confidential | Copyright 2012 Trend Micro Inc.

Page 21: A Survey of NGS Data Analysis on Hadoop

Performance on Large Data

10/31/2014 21Confidential | Copyright 2012 Trend Micro Inc.

Page 22: A Survey of NGS Data Analysis on Hadoop

Performance on Amazon EMR

10/31/2014 22Confidential | Copyright 2012 Trend Micro Inc.

Page 23: A Survey of NGS Data Analysis on Hadoop

Comparison with CloudBurst and CloudAligner

10/31/2014 23Confidential | Copyright 2012 Trend Micro Inc.

Page 24: A Survey of NGS Data Analysis on Hadoop

10/31/2014 24Confidential | Copyright 2012 Trend Micro Inc.

Page 25: A Survey of NGS Data Analysis on Hadoop

10/31/2014 25Confidential | Copyright 2012 Trend Micro Inc.

Workflow of DistMap

Page 26: A Survey of NGS Data Analysis on Hadoop

Evaluation of Read Mapping tools

10/31/2014 26Confidential | Copyright 2012 Trend Micro Inc.

Page 27: A Survey of NGS Data Analysis on Hadoop

Comparison of DistMap and other tools for distributed mapping

10/31/2014 27Confidential | Copyright 2012 Trend Micro Inc.

Page 28: A Survey of NGS Data Analysis on Hadoop

Market Movement

10/31/2014 28Confidential | Copyright 2012 Trend Micro Inc.

Page 29: A Survey of NGS Data Analysis on Hadoop

Hardware Solution -

10/31/2014 29Confidential | Copyright 2012 Trend Micro Inc.

The World’s First NGS Bioinformatics Processor

Page 30: A Survey of NGS Data Analysis on Hadoop

10/31/2014 30Confidential | Copyright 2012 Trend Micro Inc.http://www.bina.com/product.html

Page 31: A Survey of NGS Data Analysis on Hadoop

Architecture of bina Technology

10/31/2014 31Confidential | Copyright 2012 Trend Micro Inc.http://www.bina.com/technology.html

Page 32: A Survey of NGS Data Analysis on Hadoop

10/31/2014 32Confidential | Copyright 2012 Trend Micro Inc.https://www.dnanexus.com/images/usecases/dnanexus_CHARGE_prod1.png

Page 33: A Survey of NGS Data Analysis on Hadoop

Summary

• NGS is a new page for Big Data Era

• Need more CS experts to solve scalability and performance issues

• Also, need more Data Scientist to discover the secrets/insights of Human Genome

10/31/2014 33Confidential | Copyright 2012 Trend Micro Inc.

Page 34: A Survey of NGS Data Analysis on Hadoop

http://technews.tw/2014/08/02/gene-big-data/

10/31/2014 34Confidential | Copyright 2012 Trend Micro Inc.http://technews.tw/2014/08/02/gene-big-data/

Page 35: A Survey of NGS Data Analysis on Hadoop

Q&A

10/31/2014 35Confidential | Copyright 2012 Trend Micro Inc.