SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task...
Transcript of SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task...
SePhHaDe Computa/onal Challenges on High
Throughput Sequencing and Phenotyping
E. Pacitti & E. Rivals
Colloque Mastodons 22/1/2015, Paris
PLAN
¢ Introduc.on ¢ Phenotyping
� Informa.on Retrieval of Complex Contents � Search and Recommenda.on for Image Observa.ons
¢ Sequencing � Indexing reads and sequencing error correc.on � New spaced seed filtering for similarity search � Metagenomics pipeline
¢ Project fusion with Credible ¢ Conclusions
INTRODUCTION -‐ ARCHITECTURE
BIG DATA SCIENTIFQUE données de séquencage, phenotypage, images
P2P, Cloud, Flots de Données, Mu.-‐Site, HPC
Analyse de Données
Programmes pour le séquencage a Haut Débit
Indexation
Recherche d’Information de Contenus Complexes
Recommandation
PLANT PHENOTYPING
PHENOTYPING -‐ MOTIVATIONS ¢ Phenotype
� Observable state, characteris.c or behavior of a living being
� Plant morphologie , blood glucose levels
� Data: Images, Meta-‐Data
¢ Phenotyping � Observa.onal method that
records a phenotype data for analysis, quering, etc.
� Botanical observa.ons uses content-‐based mul.media iden.fica.on methods
Greenhouse based Phenotyping (Inra, Montpellier)
In the field based phenotyping (Plant Observations)
INFORMATION RETRIEVAL OF COMPLEX CONTENTS
CONTEXT
¢ Accurate knowledge of living species distribution and evolution is essential
• Ultimate goal: sustainable and global surveillance tools of living species
• global warming effects, invasive species, biodiversity, impact of Human activities
¢ It is necessary to boost the production of observations
¢ The Taxonomic gap is a tricky problem ¢ Scien.fic name = unique access key to informa.on ¢ Knowledge accessible only to specialists
Taxon Castanea sativa Mill.
LIFECLEF OBJECTIVES
A new lab of CLEF evaluation forum
• “European NIST” for information retrieval, 15 years, hundreds of research groups world wide
Objectives
• Study, evaluate and boost state-of-the-art content-based multimedia identification methods (signals+metadata)
• Assemble a transdisciplinary and cross-media community around the topic
• Promote environmental challenges in the multimedia community
LIFECLEF 2014: THREE TASKS (CHALLENGES) Based on real-world big data
LIFECLEF 2014: SCHEDULE
December 2013: registration opens January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission of runs May 15th 2014: release of results June 15th 2015: deadline for submission of working notes (peer reviewed) 15-19 September 2014: LifeCLEF Workshop at CLEF 2014 Conference (UK, Sheffield)
REGISTRANTS / COMMUNITY
42 groups 31 groups
5 groups
4 groups
36 groups
6 groups
3 groups
127 research groups registered worldwide (academics & industry)
PARTICIPANTS ON FINISH LINE
0 groups
0 groups
0 groups 10 groups 10 groups
1 group
1 group
22 groups submiOed a total of 70 runs and 22 working notes (published in CEUR-‐WS proceedings)
FOCUS ON PLANTCLEF
Hervé Goeau, Inria Pierre Bonnet, CIRAD
o Context: The Pl@ntNet initiative, a multimedia-oriented
citizen sciences and participatory sensing
France flora dataset
iPhone & Androïd applications Content-based participatory sensing
Botanical Social Network Citizen sciences
+
+ +
+350K downloads 2K users / day
+20K members +100K images + 5K species
FOCUS ON PLANTCLEF: DATA
2011 2012 2013 2014
Espèces 71 126 250 500
Images 5 400 11 500 26 077 60962
Organes/vues
Contributeurs 17 46 327 1000
Observations 368 1136 15046 30136
¢ Best scores increasing each year despite an increasing complexity of the task [Joly et al., CLEF proceedings 2014]
¢ Man vs. Machine experiment [Bonnet et al., MTAP journal 2015]
2011 2012 2013 2014
Scan-‐like 0.52 0.56 0.61 0.64
Photographs 0.25 0.32 0.40 0.47
FOCUS ON PLANTCLEF: RESULTS
NEXT YEAR
¢ Keep the 3 tasks but with enriched test data (1000 bird species, 1000 plant species)
¢ PlantCLEF novelty = authorize external training data as a variant of the task � To the condi.on that the experiment is reliable and reproduceable � Data availability, clear descrip.on, no risk of including test data etc.
¢ FishCLEF restructuring = fusion of the 4 subtasks in 1 single applica.on-‐oriented task � Coun/ng the number of fish instances of a list of species � This should avoid fragmen/on and makes the task more aOrac/ve
for breaking research
SEARCH AND RECOMMENDATION OF PLANT OBSERVATIONS
CHALLENGE
0
175
350
525
700
875
0 4500 9000 13500 18000 22500
#obs
erva
tions
#species
A few plants represents the majority of the observations! The majority of the plants are rarely observed!
A better distribution for recommendations: need for diversification !
#recommendations
Challenge: Retrieve/recommend the k most diverse plant observations given a query (e.g grape).
PROFILE DIVERSITY
With profile diversity*, the recommended observations, take into account the diverse relevant users profiles and their observations.
¢ Results: � A new scoring fonc.on based on a probabilis.c model (2013) � Top-‐k threshold algorithm for content and profile diversity (2014) � Op.miza.ons for scaling up, factor of 12 (2014) [Servajean et al.,
Informa.on Systems Journal, 2015] � Distributed Profile Diversifica.on [Servajean et al., Globe 2014] � 2 Prototypes [Servajean et al, BDA 2014] � 1 Phd Thesis Defense
¢ Next Year: Exploit Recommenda4on/ Crowd Sourcing for plant iden4fica4on
USE CASE: PROFILE DIVERSITY FOR PLANT OBSERVATION
Sequencing data analysis : a challenge
Overview
BIG DATA ANALYSIS
Recommandation
InformationretrievalComplexcontent
Next GenSequencingBioinformaticsPrograms
Indexing
Context
I 3rd generation sequencing technologies yield longer reads
I PacBio SMRT sequencing : much longer reads (up to 20 Kb)but much higher error rates
I Error correction is required
1. self correction : using only PacBio reads [Chin et al 2013]2. hybrid correction : using short reads to correct long reads
our focus !
Motivation
LR correction programs ”require high computationalresources and long running times on a supercomputereven for bacterial genome datasets”.
[Deshpande et al. 2013]
Algorithm overview
1. build a de Bruijn graph of the short reads
2. take each long read in turn and attempt to correct it
I. correct internal regions,
II. correct end regions of the long read
Example of de Bruijn Graph of order k = 3
bba
bac acb cba bab
aac baa abc
caa bca
S = {bbacbaa, cbaac , bacbab, cbabcaa, bcaacb}
Long read is corrected with DBG
bridge path
s1 t1
path not found
s2 t2
extension path
s3
For each putative region of a long read :
I align the region to paths of the de Bruijn graph
I find best path according to edit distance
I limited path search
Runtime, memory and disk usage
CPU time (h) Memory (GB) Disk (GB)0
200
400
600
800
1000
1200
Yeast
PacBioToCALSCLoRDEC
Scalability of LoRDEC
CPU time (h) Memory (GB) Disk (GB)0.1
1
10
100
1000
E. coliYeastParrot
New spaced seed filtering for similarity search
New seeds for sequence comparison
I Principle : similar sequences share exact or approximatecommon subwords
I Application
� choice of the combinatorial model for the seed (sensitivity,selectivity)
� data organization (hashtable, burst trie, suffix array, BWT,. . .)� choice of the algorithm to locate seeds
TA C GC
contiguous seed
∗A ∗ GC
spaced seed
∗∗, ε ∗, ε∗, εGTCC
∗, ε
A C
∗, ε
GC T
∗∗∗ ∗
approximate seed
(up to 1 error)
New seeds for sequence comparisonResults
I study of the coverage criterion
Coverage measure for a seed
DefinitionNumber of match symbols covered by at least one 1 symbol from anyseed hit [Benson and Mak, 2008, Martin, 2013]
ExampleATCAGTGCGAATGCGCAAGA|||||:||:|||||.|||||A•T•C•AG•CG•C•AA•A•T•G•C•TC•A•A•G•A
111*1*11
111*1*11
111*1*11
Coverage is of 15
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
I optimisation of spaced seeds for eukaryotic genomecomparison
I new type of seeds for short patterns with high error rate
ATGG TACA TCAA CGTA GCAT
ATGG TATA TCGAA CGGA GCAT
0 1 1 1 0
ATG TACA TCTA CGTA GACAT
0 1 0
New seeds for sequence comparisonSome examples of biological applications
I read mapping (ongoing) [BGE 2014]
I finding microRNA target at genome scale [IWOCA 2014]
I taxonomic assignment in metagenomics (ongoing)
I 20 000 new alignments between human and mouse genomes[NAR 2014]
I non coding RNA classification by Support Vector Machine StringKernels [JCB 2014]
Metagenomic sample analysis
Comparison of metagenomic data
• Input :
– Sequencing data from environment (water, sol, air, etc.)
– Protein sequence banks
• Output :
– The set of proteins that match with metagenomic data
comparison
Metagenomic dataRNA-seq Protein bank
List ofmatchingproteins
FUNCTIONS
Comparing metagenomic samples to protein banks is a way to functionally characterize a specific environment
PROTEIN => FUNCTION
Common method
BLASTx
Metagenomic data
Protein bank
SELECT
Protein list
Standard software used by everyone
Time consuming process:Several hours (days) of computation on multicore systems
MASTODONS CHALLENGE
speed-up the process at least one order of magnitude
MASTODONS Approach
Metagenomic data
MetaContiger
PLASTx SELECT
Protein bank Protein list
BLASTx
PLASTxSoftware developed by GenScale (before MASTODONS)SPEED-UP = ~ X5
MetaContigerNew software developed in this projectEliminate redundancy of metagenomic data significantly decrease the number of metagenomic sequences to compare
Standard approach
Results
• Project still going on
• Preliminary results:
– Global speed-up : from X10 to X30 (1 day vs 1 hour)
– Highly correlated to redundancy of metagenomic data
• Future
– Validation from a qualitative point of view
– Test on various metagenomic projects
– Extend the method to the general sequence comparison problem
Conclusion
Actions and highlights
I Colloque � Indexing scientific big data �
147 participants, Paris 15 Jan 2014
I Joint Workshop with COST Action SeqAhead� Data Structures in Bioinformatics � 10 countries
I LifeClef challenge launched and meetings over 2014
I New partner teams : Telabotanica, Univ. Rouen, UPMC,Paris 5, CIRAD
BIG DATA SCIENTIFQUEdonnées de séquencage, phenotypage, images
Analyse de Données etworkflows
Programmes pour le
Séquençage àHaut Débit
IndexationMédiation
Recommandation et Recherchede Contenus Complexes
P2P, Cloud, Muti-Site, HPC
• Volumineuses• Complexes• Hétérogènes
Future
I Fusion between projects SePhHaDe and Credible
I New graphs, index and algo. for genome assembly
I Metagenomic pipeline
I Recommandation and plant identification for LifeCLEF
I New edition workshop � Data Structures in Bioinformatics �
Publications
I Drezen et al., GATB : Genome Assembly & Analysis Tool Box,Bioinformatics, 2014
I Salmela et Rivals, LoRDEC : accurate and efficient long read error correction,Bioinformatics, 2014
I Noe et Martin, A coverage criterion for spaced seeds and its applications toSVM string kernels and k-mer distances, J. Computational Biology, 2014.
I Frith et Noe, Improved search heuristics find 20 000 new alignments betweenhuman and mouse genomes, Nucleic Acids Research, 2014.
I Cazaux et al., From Indexing Data Structures to de Bruijn Graphs, CPM, 2014
I Blanc-Mathieu et al., An improved genome of the model marine algaOstreococcus tauri unfolds by assessing Illumina read de novo assemblies, BMCGenomics, 2014
I Servajean et al., Profile Diversity for Query Processing using UserRecommendations, Information Systems, 2015
I Joly et al., Are Species Identification Tools Biodiversity-friendly ?, ACM IW onMultimedia Analysis for Ecological Data, 2014
I Joly et al., Lifeclef 2014 : multimedia life species identification challenges.Information Access Evaluation, 2014
I Cazaux et Rivals, Reverse Engineering of Compact Suffix Trees and Links, J.Discrete Algorithms, 2014.
Partners
Merci pour votre attention