BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders...
Transcript of BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders...
BioCompute & SciDBa pipeline-in-a-database
© Paradigm4 2
Topics
• Why a pipeline-in-database? Copy NCBI• SciDB: a scientific data storage and computing platform• BioCompute Example: Group-based somatic mutation calling
3.2 billion columns, millions of rows, analysis-ready, not in files
© Paradigm4 3
Typical Research Workflow
Data Exploration
Pipeline Data Generation
© Paradigm4 4
Typical Research Workflow
Data Exploration
Pipeline Data Generation
COMPUTE
LOAD
© Paradigm4 5
Typical Research Workflowas BioCompute objects
Data Exploration
Pipeline Data Generation
COMPUTE
LOAD
© Paradigm4 6
Group based variants – big pileups
• Workflow– merge multiple BAMs– sort reads by genomic coordinate
• 8.8 TB (2x 4.4TB flash drive) needed for the sort, in addition to the distributed file system
• San Diego Super Computer Center
© Paradigm4 7
What did they find in Human Genomes
Optimum was ~ 15 whole genomes at a time
Known variants (dark blue and dark green)
Novel variants (light blue and light green)
Attribute Single file x 100s Grouped Files Concordance
Total variants 30,790,918 29,915,861 81.4%
Unique variants 2,668,331 3,543,283 MHC, Y, Tele
Minor Allele Frequency 1-5% <1%
Ti/Tv – 2.19 is ideal 1.2 1.6
© Paradigm4 8
SciDBa scalable Scientific, Computational DBMS
SciDB blurs the line between storage and
computation
In-situ, massively scalable analytics
Scientific data are stored natively as multi-
dimensional arrays
Genomic coordinate
Chr
omos
ome
Patie
ntPa
tient
© Paradigm4 9
Pipeline data generation
For reproducibility and re-analysis• Minimize data movement and copying• Data stored in analysis-ready form• Metadata stored with data (BioCompute Requirement)• Rapid selection of specific data of interest
SciDB
QueryF (x, y, z)
QueryG (x,y,z)
QueryT (x,y,z)
3.2 0.3 0.1 11
3.4 0.0 0.8 10
1.1 1.0 1.2 14
0.9 1.0 1.2 13
SAM
PLES
DIM
ENSI
ON
AKT1 EGFR TP53 ZNF11
TCGA1
TCGA2
TCGA3
TCGA4
GENES DIMENSION
ARRAY3.2 0.3 0.1 11
3.4 0.0 0.8 10
1.1 1.0 1.2 14
0.9 1.0 1.2 13
SAM
PLES
DIM
ENSI
ON
AKT1 EGFR TP53 ZNF11
TCGA1
TCGA2
TCGA3
TCGA4
GENES DIMENSION
ARRAY3.2 0.3 0.1 11
3.4 0.0 0.8 10
1.1 1.0 1.2 14
0.9 1.0 1.2 13
SAM
PLES
DIM
ENSI
ON
AKT1 EGFR TP53 ZNF11
TCGA1
TCGA2
TCGA3
TCGA4
GENES DIMENSION
ARRAY3.2 0.3 0.1 11
3.4 0.0 0.8 10
1.1 1.0 1.2 14
0.9 1.0 1.2 13
SAM
PLES
DIM
ENSI
ON
AKT1 EGFR TP53 ZNF11
TCGA1
TCGA2
TCGA3
TCGA4
GENES DIMENSION
ARRAY
© Paradigm4 10
Reproducible research
• Curate once• Explore many times• By many concurrent users• Enforces data integrity• Versions data• Track and trace data and queries
Data Exploration
SciDB
Data Exploration
Data Exploration
Data Exploration
User 1 User 3 User 4User 2
© Paradigm4 11
SciDB & BioCompute Objects
SciDB
• Data loaders enforce type & field constraints• Arrays are versioned and time-stamped• Database log tracks all parameters, data changes, user actions• Utility could represent queries as JSON objects
© Paradigm4 12
© Paradigm4 13
Group-based Somatic Point Mutation Calling
• Reveals statistically significant minority somatic mutations
• Technique provides FDA explanation and proof of repeatability, traceability and accuracy of variant calling pipeline
Simultaneous large group pile-ups provide more accurate identification and accommodation of sequencing errors*
* Standish, et al. BMC Bioinformatics (2015) 16:304
© Paradigm4 14
Somatic Point Variant Caller
BAM 1
BAM 2
BAM…
Aggregate Total Across BAMs
Per BAM
Count A,C,T,G,? Per Position
1:2:
…
Reference Arrays (some examples):
Join
+
Compute
stats
Reference Genome:
Known Variants:
Quality Regions:
Filtered CallsPer position
Filter based on:• Base Quality• Read Mapping
Quality• Coverage
Test using spike in
filter
Pile Up over ALL BAM files in single large array
Per position stats
© Paradigm4 15
Methodology
• Formulate the caller as a simple signal/noise filter
• Tunable and explainable ROC curves (receiver operating characteristics)
– Control false positive and false negatives rates by experimenting with settings for noise and filter thresholds without having to reload data
– Can use ‘spiked in’ data to provide known answer to guide parameter setting for variant calling QA
© Paradigm4 16
Base-level PHRED Score Distribution
Phred Score Probability of incorrect call Call accuracy10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
60 1 in 1,000,000 99.9999%
© Paradigm4 17
MAPQ Scores and Coverage
• Histograms show % of bases excluded at specific thresholds• PHRED and MAPQ thresholds affect coverage distribution
© Paradigm4 18
Effect of PHRED & MAPQ thresholds on noise
Thresholding PHRED and MAPQ scores has an effect of "shifting" the noise to the lower allele frequency band
Noise Histogram
Second Call Ratio 2nd most common base / total coverage at that position
Den
sity
Cou
nt /
tota
l cov
erag
e at
that
pos
ition
© Paradigm4 19
Low-complexity filter or low-confidence filter does not reduce noise further
Low complexity filter Low confidence region filter
Second Call Ratio Second Call Ratio
Dens
ity
Dens
ity
© Paradigm4 20
BioCompute PiD Concept"id": "obj.1243","type”: "biocompute”,"name": "SciDB GIAB minority clone variant call SciDB PiD",#author,created,…"parametric_domain" : {"phred_threshold" : "50","coverage_threshold: "450",…
} "scidb_domain": { "hostname": "clust_scidb_01","db_user": "apoliakov",…"arrays": [{
"id": "obj.1243""name": "GIAB_CLONALITY.BAM_DATA""schema_type": "Multi-sample BAM Pileup Array""size": "78.5TB""created": "Jan 10 2017 11:57:34",…
JSON can be stored directly in the database
© Paradigm4 21
BioCompute PiD Concept"scidb_domain": {
…"arrays": [{
"id": "obj.1243","name": "GIAB_CLONALITY.BAM_DATA","schema_type": "Multi-sample BAM Array","size": "78.5TB","created": "Jan 10 2017 11:57:34","modified":…
}, {
"id": "obj.1244","name": "GIAB_CLONALITY.BAM_PILEUP","schema_type": "BAM per-sample filtered BAM pileup""size": "350GB","created": "Jan 10 2017 18:67:34",…
},...}
© Paradigm4 22
BioCompute PiD Concept
"execution_domain": [ { "id": "obj.1237", "env_parameters": "example", "location": "/workflows/giab_clonality.R", "platform": "SciDB-R", ...
}]
• Execution points to SciDB query scripts as before
• Optional Array Version History "arrays": [{
"name": "GIAB_CLONALITY.BAM_DATA","versions": [{
"version_id": "3""created": "Jan 10 2017 11:59:34",
}, ...
BioCompute & SciDBa pipeline-in-a-database
Zachary Pitluk, Ph.D., V.P. Life Sciences & [email protected]