SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task...

SePhHaDe Computa/onal Challenges on High

Throughput Sequencing and Phenotyping

E. Pacitti & E. Rivals

Colloque Mastodons 22/1/2015, Paris

PLAN

¢  Introduc.on ¢  Phenotyping

�  Informa.on Retrieval of Complex Contents �  Search and Recommenda.on for Image Observa.ons

¢  Sequencing �  Indexing reads and sequencing error correc.on �  New spaced seed filtering for similarity search �  Metagenomics pipeline

¢  Project fusion with Credible ¢  Conclusions

INTRODUCTION -‐ ARCHITECTURE

BIG DATA SCIENTIFQUE données de séquencage, phenotypage, images

P2P, Cloud, Flots de Données, Mu.-‐Site, HPC

Analyse de Données

Programmes pour le séquencage a Haut Débit

Indexation

Recherche d’Information de Contenus Complexes

Recommandation

PLANT PHENOTYPING

PHENOTYPING -‐ MOTIVATIONS ¢  Phenotype

�  Observable state, characteris.c or behavior of a living being

�  Plant morphologie , blood glucose levels

�  Data: Images, Meta-‐Data

¢  Phenotyping �  Observa.onal method that

records a phenotype data for analysis, quering, etc.

�  Botanical observa.ons uses content-‐based mul.media iden.fica.on methods

Greenhouse based Phenotyping (Inra, Montpellier)

In the field based phenotyping (Plant Observations)

INFORMATION RETRIEVAL OF COMPLEX CONTENTS

CONTEXT

¢  Accurate knowledge of living species distribution and evolution is essential

•  Ultimate goal: sustainable and global surveillance tools of living species

•  global warming effects, invasive species, biodiversity, impact of Human activities

¢  It is necessary to boost the production of observations

¢  The Taxonomic gap is a tricky problem ¢ Scien.fic name = unique access key to informa.on ¢ Knowledge accessible only to specialists

Taxon Castanea sativa Mill.

LIFECLEF OBJECTIVES

A new lab of CLEF evaluation forum

•  “European NIST” for information retrieval, 15 years, hundreds of research groups world wide

Objectives

•  Study, evaluate and boost state-of-the-art content-based multimedia identification methods (signals+metadata)

•  Assemble a transdisciplinary and cross-media community around the topic

•  Promote environmental challenges in the multimedia community

LIFECLEF 2014: THREE TASKS (CHALLENGES) Based on real-world big data

LIFECLEF 2014: SCHEDULE

December 2013: registration opens January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission of runs May 15th 2014: release of results June 15th 2015: deadline for submission of working notes (peer reviewed) 15-19 September 2014: LifeCLEF Workshop at CLEF 2014 Conference (UK, Sheffield)

REGISTRANTS / COMMUNITY

42 groups 31 groups

5 groups

4 groups

36 groups

6 groups

3 groups

127 research groups registered worldwide (academics & industry)

PARTICIPANTS ON FINISH LINE

0 groups

0 groups

0 groups 10 groups 10 groups

1 group

1 group

22 groups submiOed a total of 70 runs and 22 working notes (published in CEUR-‐WS proceedings)

FOCUS ON PLANTCLEF

Hervé Goeau, Inria Pierre Bonnet, CIRAD

o  Context: The Pl@ntNet initiative, a multimedia-oriented

citizen sciences and participatory sensing

France flora dataset

iPhone & Androïd applications Content-based participatory sensing

Botanical Social Network Citizen sciences

+

+ +

+350K downloads 2K users / day

+20K members +100K images + 5K species

FOCUS ON PLANTCLEF: DATA

2011 2012 2013 2014

Espèces 71 126 250 500

Images 5 400 11 500 26 077 60962

Organes/vues

Contributeurs 17 46 327 1000

Observations 368 1136 15046 30136

¢  Best scores increasing each year despite an increasing complexity of the task [Joly et al., CLEF proceedings 2014]

¢ Man vs. Machine experiment [Bonnet et al., MTAP journal 2015]

2011 2012 2013 2014

Scan-‐like 0.52 0.56 0.61 0.64

Photographs 0.25 0.32 0.40 0.47

FOCUS ON PLANTCLEF: RESULTS

NEXT YEAR

¢  Keep the 3 tasks but with enriched test data (1000 bird species, 1000 plant species)

¢  PlantCLEF novelty = authorize external training data as a variant of the task �  To the condi.on that the experiment is reliable and reproduceable �  Data availability, clear descrip.on, no risk of including test data etc.

¢  FishCLEF restructuring = fusion of the 4 subtasks in 1 single applica.on-‐oriented task �  Coun/ng the number of fish instances of a list of species �  This should avoid fragmen/on and makes the task more aOrac/ve

for breaking research

SEARCH AND RECOMMENDATION OF PLANT OBSERVATIONS

CHALLENGE

0

175

350

525

700

875

0 4500 9000 13500 18000 22500

#obs

erva

tions

#species

A few plants represents the majority of the observations! The majority of the plants are rarely observed!

A better distribution for recommendations: need for diversification !

#recommendations

Challenge: Retrieve/recommend the k most diverse plant observations given a query (e.g grape).

PROFILE DIVERSITY

With profile diversity*, the recommended observations, take into account the diverse relevant users profiles and their observations.

¢  Results: �  A new scoring fonc.on based on a probabilis.c model (2013) �  Top-‐k threshold algorithm for content and profile diversity (2014) �  Op.miza.ons for scaling up, factor of 12 (2014) [Servajean et al.,

Informa.on Systems Journal, 2015] �  Distributed Profile Diversifica.on [Servajean et al., Globe 2014] �  2 Prototypes [Servajean et al, BDA 2014] �  1 Phd Thesis Defense

¢  Next Year: Exploit Recommenda4on/ Crowd Sourcing for plant iden4fica4on

USE CASE: PROFILE DIVERSITY FOR PLANT OBSERVATION

Sequencing data analysis : a challenge

Overview

BIG DATA ANALYSIS

Recommandation

InformationretrievalComplexcontent

Next GenSequencingBioinformaticsPrograms

Indexing

Context

I 3rd generation sequencing technologies yield longer reads

I PacBio SMRT sequencing : much longer reads (up to 20 Kb)but much higher error rates

I Error correction is required

1. self correction : using only PacBio reads [Chin et al 2013]2. hybrid correction : using short reads to correct long reads

our focus !

Motivation

LR correction programs ”require high computationalresources and long running times on a supercomputereven for bacterial genome datasets”.

[Deshpande et al. 2013]

Algorithm overview

1. build a de Bruijn graph of the short reads

2. take each long read in turn and attempt to correct it

I. correct internal regions,

II. correct end regions of the long read

Example of de Bruijn Graph of order k = 3

bba

bac acb cba bab

aac baa abc

caa bca

S = {bbacbaa, cbaac , bacbab, cbabcaa, bcaacb}

LoRDEC uses GATB (from IRISA partner Rennes)

http://gatb.inria.fr

http://gatb.inria.fr

Long read is corrected with DBG

bridge path

s1 t1

path not found

s2 t2

extension path

s3

For each putative region of a long read :

I align the region to paths of the de Bruijn graph

I find best path according to edit distance

I limited path search

Runtime, memory and disk usage

CPU time (h) Memory (GB) Disk (GB)0

200

400

600

800

1000

1200

Yeast

PacBioToCALSCLoRDEC

Scalability of LoRDEC

CPU time (h) Memory (GB) Disk (GB)0.1

1

10

100

1000

E. coliYeastParrot

New spaced seed filtering for similarity search

New seeds for sequence comparison

I Principle : similar sequences share exact or approximatecommon subwords

I Application

� choice of the combinatorial model for the seed (sensitivity,selectivity)

� data organization (hashtable, burst trie, suffix array, BWT,. . .)� choice of the algorithm to locate seeds

TA C GC

contiguous seed

∗A ∗ GC

spaced seed

∗∗, ε ∗, ε∗, εGTCC

∗, ε

A C

∗, ε

GC T

∗∗∗ ∗

approximate seed

(up to 1 error)

New seeds for sequence comparisonResults

I study of the coverage criterion

Coverage measure for a seed

DefinitionNumber of match symbols covered by at least one 1 symbol from anyseed hit [Benson and Mak, 2008, Martin, 2013]

ExampleATCAGTGCGAATGCGCAAGA|||||:||:|||||.|||||A•T•C•AG•CG•C•AA•A•T•G•C•TC•A•A•G•A

111*1*11

111*1*11

111*1*11

Coverage is of 15

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

I optimisation of spaced seeds for eukaryotic genomecomparison

I new type of seeds for short patterns with high error rate

ATGG TACA TCAA CGTA GCAT

ATGG TATA TCGAA CGGA GCAT

0 1 1 1 0

ATG TACA TCTA CGTA GACAT

0 1 0

New seeds for sequence comparisonSome examples of biological applications

I read mapping (ongoing) [BGE 2014]

I finding microRNA target at genome scale [IWOCA 2014]

I taxonomic assignment in metagenomics (ongoing)

I 20 000 new alignments between human and mouse genomes[NAR 2014]

I non coding RNA classification by Support Vector Machine StringKernels [JCB 2014]

Metagenomic sample analysis

Comparison of metagenomic data

• Input :

– Sequencing data from environment (water, sol, air, etc.)

– Protein sequence banks

• Output :

– The set of proteins that match with metagenomic data

comparison

Metagenomic dataRNA-seq Protein bank

List ofmatchingproteins

FUNCTIONS

Comparing metagenomic samples to protein banks is a way to functionally characterize a specific environment

PROTEIN => FUNCTION

Common method

BLASTx

Metagenomic data

Protein bank

SELECT

Protein list

Standard software used by everyone

Time consuming process:Several hours (days) of computation on multicore systems

MASTODONS CHALLENGE

speed-up the process at least one order of magnitude

MASTODONS Approach

Metagenomic data

MetaContiger

PLASTx SELECT

Protein bank Protein list

BLASTx

PLASTxSoftware developed by GenScale (before MASTODONS)SPEED-UP = ~ X5

MetaContigerNew software developed in this projectEliminate redundancy of metagenomic data significantly decrease the number of metagenomic sequences to compare

Standard approach

Results

• Project still going on

• Preliminary results:

– Global speed-up : from X10 to X30 (1 day vs 1 hour)

– Highly correlated to redundancy of metagenomic data

• Future

– Validation from a qualitative point of view

– Test on various metagenomic projects

– Extend the method to the general sequence comparison problem

Conclusion

Actions and highlights

I Colloque � Indexing scientific big data �

147 participants, Paris 15 Jan 2014

I Joint Workshop with COST Action SeqAhead� Data Structures in Bioinformatics � 10 countries

I LifeClef challenge launched and meetings over 2014

I New partner teams : Telabotanica, Univ. Rouen, UPMC,Paris 5, CIRAD

http://seqahead.cs.tu-dortmund.de/meetings:dsb

BIG DATA SCIENTIFQUEdonnées de séquencage, phenotypage, images

Analyse de Données etworkflows

Programmes pour le

Séquençage àHaut Débit

IndexationMédiation

Recommandation et Recherchede Contenus Complexes

P2P, Cloud, Muti-Site, HPC

• Volumineuses• Complexes• Hétérogènes

Future

I Fusion between projects SePhHaDe and Credible

I New graphs, index and algo. for genome assembly

I Metagenomic pipeline

I Recommandation and plant identification for LifeCLEF

I New edition workshop � Data Structures in Bioinformatics �

http://seqahead.cs.tu-dortmund.de/meetings:dsb

Publications

I Drezen et al., GATB : Genome Assembly & Analysis Tool Box,Bioinformatics, 2014

I Salmela et Rivals, LoRDEC : accurate and efficient long read error correction,Bioinformatics, 2014

I Noe et Martin, A coverage criterion for spaced seeds and its applications toSVM string kernels and k-mer distances, J. Computational Biology, 2014.

I Frith et Noe, Improved search heuristics find 20 000 new alignments betweenhuman and mouse genomes, Nucleic Acids Research, 2014.

I Cazaux et al., From Indexing Data Structures to de Bruijn Graphs, CPM, 2014

I Blanc-Mathieu et al., An improved genome of the model marine algaOstreococcus tauri unfolds by assessing Illumina read de novo assemblies, BMCGenomics, 2014

I Servajean et al., Profile Diversity for Query Processing using UserRecommendations, Information Systems, 2015

I Joly et al., Are Species Identification Tools Biodiversity-friendly ?, ACM IW onMultimedia Analysis for Ecological Data, 2014

I Joly et al., Lifeclef 2014 : multimedia life species identification challenges.Information Access Evaluation, 2014

I Cazaux et Rivals, Reverse Engineering of Compact Suffix Trees and Links, J.Discrete Algorithms, 2014.

Partners

Merci pour votre attention

SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task...

Documents

Transcript of SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task...