A Survey of NGS Data Analysis on Hadoop
-
Upload
- -
Category
Presentations & Public Speaking
-
view
602 -
download
4
Transcript of A Survey of NGS Data Analysis on Hadoop
Chung-Tsai Su
SPN Architect, Core Tech
Trend Micro
2014/10/31 @CSIE.NTU
Introduction of NGS Data Analysis on Hadoop
110/31/2014 Confidential | Copyright 2012 Trend Micro Inc.
Q&A
10/31/2014 2Confidential | Copyright 2012 Trend Micro Inc.http://setmoney.blob.core.windows.net/newsimages/2014/09/04/136352-XXL.jpg
NGS Pipeline
10/31/2014 4Confidential | Copyright 2012 Trend Micro Inc.
High-Level Workflow of NGS
10/31/2014 5Confidential | Copyright 2012 Trend Micro Inc.
Read
Mapping
Raw
Reads
(.fq)
Variant
Calling
Sequence
Alignment/
Mapping
(.sam/.bam)
Variant
Calling file
(.vcf)
NGS Data Analysis Pipeline
• GATK best practice
10/31/2014 6Confidential | Copyright 2012 Trend Micro Inc.https://www.broadinstitute.org/gatk/guide/best-practices?bpm=DNAseq
illumina solution
7http://systems.illumina.com/content/dam/illumina-
marketing/documents/products/brochures/brochure_sequencing_systems_portfolio.pdf
The First $1,000 Genome – illumina HiSeq X Ten
10/31/2014 8Confidential | Copyright 2012 Trend Micro Inc.http://systems.illumina.com/systems/hiseq-x-sequencing-system.html
Expectation of Data Processing Power for illumina HiSeq X Ten
• A cluster of 10 HiSeq X instruments
• Capable of sequencing up to 18,000 whole human genomes each year
– Has a run cycle of ~3 days and produces ~150 genomes each run cycle
– Running the industry standard BWA+GATK analysis pipeline to perform this analysis on a reasonably high-end (Dual Intel Xeon E5-2697v2 CPU – 12 core, 2.7 GHz with 96 GB DRAM) compute server takes ~24 hours per genome.
– To achieve the required throughput of 150 genomes every three days, at least 50 of these servers are required.
• Should meet a target of ~28 minutes for the completion of the mapping, aligning, sorting, de-duplication and variant calling of each genome.
10/31/2014 9Confidential | Copyright 2012 Trend Micro Inc.http://www.edicogenome.com/dragen/
Literature Survey
10/31/2014 10Confidential | Copyright 2012 Trend Micro Inc.
Literature
• CloudBurst, 2009
• CloudAligner, 2011
• DistMap, 2013
10/31/2014 11Confidential | Copyright 2012 Trend Micro Inc.
10/31/2014 12Confidential | Copyright 2012 Trend Micro Inc.
Algorithm of CloudBurst
10/31/2014 13Confidential | Copyright 2012 Trend Micro Inc.
Seed-and-Extend
Algorithm
Performance of CloudBurst
EECS$584$–$Fall$2013$
Experiments$
• Scalability+
0
2000
4000
6000
8000
10000
12000
14000
16000
0 1 2 3 4 5 6 7 8
Ru
nti
me (
s)
Millions of Reads
Running Time vs Number of Reads on Chr 1
0 1
2 3
4
10/31/2014 14Confidential | Copyright 2012 Trend Micro Inc.
Speedup over Serial RMAP
EECS$584$–$Fall$2013$
Experiments$
• Speedup+over+serial+RMAP+
0
5
10
15
20
25
30
35
40
0 1 2 3 4
Sp
ee
du
p
Number of Mismatches
Speedup over serial RMAP
chr1 chr22
10/31/2014 15Confidential | Copyright 2012 Trend Micro Inc.
Speedup on EC2
EECS$584$–$Fall$2013$
Experiments$
• Speedup+on+EC2+
0
200
400
600
800
1000
1200
1400
1600
1800
24 48 72 96
Ru
nn
ing t
ime (
s)
Number of Cores
Running Time on EC2 High-CPU Medium Instance Cluster
10/31/2014 16Confidential | Copyright 2012 Trend Micro Inc.
10/31/2014 17Confidential | Copyright 2012 Trend Micro Inc.
Overhead of Disk I/O
10/31/2014 18Confidential | Copyright 2012 Trend Micro Inc.
Architecture of CloudAligner
10/31/2014 19Confidential | Copyright 2012 Trend Micro Inc.
Seed-and-Extend
Algorithm
Performance on Small Data
10/31/2014 20Confidential | Copyright 2012 Trend Micro Inc.
Performance on Large Data
10/31/2014 21Confidential | Copyright 2012 Trend Micro Inc.
Performance on Amazon EMR
10/31/2014 22Confidential | Copyright 2012 Trend Micro Inc.
Comparison with CloudBurst and CloudAligner
10/31/2014 23Confidential | Copyright 2012 Trend Micro Inc.
10/31/2014 24Confidential | Copyright 2012 Trend Micro Inc.
10/31/2014 25Confidential | Copyright 2012 Trend Micro Inc.
Workflow of DistMap
Evaluation of Read Mapping tools
10/31/2014 26Confidential | Copyright 2012 Trend Micro Inc.
Comparison of DistMap and other tools for distributed mapping
10/31/2014 27Confidential | Copyright 2012 Trend Micro Inc.
Market Movement
10/31/2014 28Confidential | Copyright 2012 Trend Micro Inc.
Hardware Solution -
10/31/2014 29Confidential | Copyright 2012 Trend Micro Inc.
The World’s First NGS Bioinformatics Processor
10/31/2014 30Confidential | Copyright 2012 Trend Micro Inc.http://www.bina.com/product.html
Architecture of bina Technology
10/31/2014 31Confidential | Copyright 2012 Trend Micro Inc.http://www.bina.com/technology.html
10/31/2014 32Confidential | Copyright 2012 Trend Micro Inc.https://www.dnanexus.com/images/usecases/dnanexus_CHARGE_prod1.png
Summary
• NGS is a new page for Big Data Era
• Need more CS experts to solve scalability and performance issues
• Also, need more Data Scientist to discover the secrets/insights of Human Genome
10/31/2014 33Confidential | Copyright 2012 Trend Micro Inc.
http://technews.tw/2014/08/02/gene-big-data/
10/31/2014 34Confidential | Copyright 2012 Trend Micro Inc.http://technews.tw/2014/08/02/gene-big-data/
Q&A
10/31/2014 35Confidential | Copyright 2012 Trend Micro Inc.