GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1,...

42
GATB programming day G.Rizk, R.Chikhi Genscale, Rennes 15/06/2016 G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 1 / 41

Transcript of GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1,...

Page 1: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

GATB programming day

G.Rizk, R.Chikhi

Genscale, Rennes

15/06/2016

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 1 / 41

Page 2: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

GATB

>read 1ACGACGACGTAGACGACTAGCAAAACTACGATCGACTAT>read 2ACTACTACGATCGATGGTCGCGCTGCTCGCTCTCTCGCT...>read 10.000.000TCTCCTAGCGCGGCGTATACGCTCGCTAGCTACGTAGCT...

➢ NGS technologies produce terabytes of data

➢ Efficient and fast algorithms are essential to analyze such data

➢ The Genome Assembly Tool Box (GATB)

1. Open-source software developed by GENSCALE

2. Easy way to develop efficient and fast NGS tools

3. Based on data structure with a very low memory footprint

4. Complex genomes can be processed on desktop computers

INTRODUCTIONINTRODUCTION

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 2 / 41

Page 3: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

GATB

➢ The GATB philosophy: a 3-layer construction to analyze NGS datasets

INTRODUCTIONINTRODUCTION

GATB-CORE

GATB-TOOLS

GATB-PIPELINETHIRD

PARTIES

1. GATB-CORE: a C++ library holding all

the services needed for developing

software dedicated to NGS data

2. GATB-TOOLS: a set of elementary

NGS tools mainly built upon the GATB

library (k-mer counter, contiger,

scaffolder, variant detection, etc.)

3. GATB-PIPELINE: a set of NGS pipeline that links together

tools from the previous layer

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 3 / 41

Page 4: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

GATB

GATB-CORE structure for NGS data

➢ Compact de Bruijn graph

➢ Encodes most of the information from the sequencing reads

➢ Graph with low memory footprint

● Complex genomes can be processed on a desktop computer

● A whole human genome handled with 6 GBytes

● The raspberry genome can be assembled on a Raspberry Pi

INTRODUCTIONINTRODUCTION

CGC

GCT

CCG

TCC

CTA

TAT

ATC

CGA

AGC

ATT

GAG

AAA

GGA

TGG

TTG

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 4 / 41

Page 5: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

GATB

>read 1 ACGACGACGTAGTAAACTACGATCGACTAT

ACGACGACGTA kmer 1 CGACGACGTAG kmer 2 GACGACGTAGT kmer 3 CGACGTAGTA kmer 4 ... ... ACGATCGACTA kmer 18 CGATCGACTAT kmer 19

➢ Each read is split into words named kmers

➢ A kmer has a fixed size K

➢ Example for K=11

FROM READS TO DE BRUIJN GRAPHFROM READS TO DE BRUIJN GRAPH

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 5 / 41

Page 6: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

GATB

➢ Count kmers and keep solid kmers (i.e. present at least N times)

➢ Each solid kmer is inserted as a node of a de Bruijn graph

➢ Nodes A,B are connected <=> suffix(A,K-1) = prefix(B,K-1)

➢ Two reads sharing (K-1) overlapping will be connected in the de Bruijn graph

FROM READS TO DE BRUIJN GRAPHFROM READS TO DE BRUIJN GRAPH

>read 1 ACGACGACGTAGTAAACTACGATCGACTAT

>read 2 CTACGATCGACTATTAGTGATGATAGATAGAT

Kmers specific to read 1

Kmers common to read 1 and read 2

Kmers specific to read 2

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 6 / 41

Page 7: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

GATB➢ In a nutshell...

1. Split the reads into kmers

2. Count kmers and keep the solid ones

3. Insert the solid kmers into a de Bruijn graph

>read 1ACGACGACGTAGACGACTAGCTAGCAATGCTAGCTAGGATCAAAACTACGATCGACTAT

>read 2ACTACTACGATCGATGGTCGAGGGCGAGCTAGCTAGCTGACGCTGCTCGCTCTCTCGCT

...

>read 10.000.000TCTCCTAGCGCGGCGTATACGCGCTAAGCTAGCTCTCGCTGCTCGCTAGCTACGTAGCT

...

FROM READS TO DE BRUIJN GRAPHFROM READS TO DE BRUIJN GRAPH

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 7 / 41

Page 8: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

GATB

➢ GATB-CORE only stores nodes of the de Bruijn graph, the edges can be computed on the fly when needed.

➢ The nodes are stored in a Bloom filter, a space-efficient structure

➢ Two reasons for a very low memory footprint

FROM READS TO DE BRUIJN GRAPHFROM READS TO DE BRUIJN GRAPH

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 8 / 41

Page 9: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

GATB

➢ So, GATB-CORE transforms a set of reads into a compact de Bruijn graph

➢ The graph is saved in HDF5 format

➢ Such a graph can be used by tools developed with the GATB-CORE C++ Library

Gra

ph

(

HD

F5

)

Re

ad

s (

FA

ST

A)

Library

API

Binaries

GATB-CORE

Tool 2

Tool 1

Tool 3

GATB-TOOLS>14_G1511837AGTCGGCTAGCATAGTGCTCAGGAGCTTAAACATGCATGAGAG

>14_G1517637ATCGACTTCTCTTCTTTTCGAGCTTAGCTAATCA

FROM READS TO DE BRUIJN GRAPHFROM READS TO DE BRUIJN GRAPH

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 9 / 41

Page 10: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

GATB

➢ GATB-CORE C++ library

● Fast development of efficient tools

● Easy and fast learning curve (full documentation with numerous code samples)

➢ From developers point of view

● don't have to bother with the de Bruijn graph construction

● focus on their own algorithms

➢ GATB-CORE high level API

THE GATB-CORE APITHE GATB-CORE API

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 10 / 41

Page 11: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

GATB

➢ The system package holds all the operations related to the operating system: file management, memory management and thread management.

➢ The tools package offers generic operations used throughout user applicative code, but not specific to genomic area.

➢ The bank package provides operations related to standard genomic sequence dataset management. Using this package allows to write algorithms independently of the input format.

➢ The kmer package is dedicated to fine-grained manipulation of k-mers.

➢ The debruijn package provides high-level functions to manipulate the de Bruijn graph data structure

THE GATB-CORE APITHE GATB-CORE API

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 11 / 41

Page 12: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

GATB

➢ Details on the graph API (class Graph)

● Iterate all the nodes of a graph

● Iterate branching nodes of a graph

● Get neighbors nodes/edges from a node

● Get in/out degree from a node

● Depth First Search from a node

● Breadth First Search from a node

➢ Two use cases of de Bruijn graph navigation

● Assembly

● Detection of patterns in the graph

THE GATB-CORE APITHE GATB-CORE API

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 12 / 41

Page 13: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 13 / 41

Page 14: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

GATB-core Coding session

Basic input/output15 min break

Kmerization / graph APILunch

Mini tool: error correction.

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 14 / 41

Page 15: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

Documentation

Open http://gatb-core.gforge.inria.fr

”How to use the library?” : many code examples.

Use top right search field to find doc for a class.

Sources in the gatb-core library examples/

General design of library, documentation of API.

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 15 / 41

Page 16: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

GATB input/output

GATB can :

read files in fasta or fastq format.

read gzipped files directly.

no need to tell which format it is.

read list of files.

write fasta/fastq files.

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 16 / 41

Page 17: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

Introduction : code io1.cpp

#include <gatb/gatb_core.hpp >

int main (int argc , char* argv [])

{

IBank * inbank = Bank::open(argv [1]);

Iterator <Sequence >* it = inbank ->iterator ();

for (it->first (); !it ->isDone (); it->next())

{

// Shortcut

Sequence& seq = it->item();

// We dump the data size and the comment

std::cout << "[" << seq.getDataSize () << "]" << seq.

getComment () << std::endl;

// We dump the data

std::cout << seq.toString () << std::endl;

}

}

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 17 / 41

Page 18: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

First project

Go to gatb/devel/samples

Compile first example with:make io1

Try with ./io1 read.fastq

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 18 / 41

Page 19: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

IO: Task 1, read a fileOpen the GATB documentation pagehttp://gatb-core.gforge.inria.fr

Have a look at the IBank interface (type IBank inthe search field).

Scroll down to find methods estimateNbItems andestimateSequencesSize.

Modify the code to print some bank information.

Should look like this :

int64_t nbseq = inbank ->estimateNbItems ();

u_int64_t totalsize = inbank ->estimateSequencesSize ();

std::cout << "nbseq : " << nbseq << " totalsize " << totalsize

<< std::endl;

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 19 / 41

Page 20: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

IO: Task 1, read a fileOpen the GATB documentation pagehttp://gatb-core.gforge.inria.fr

Have a look at the IBank interface (type IBank inthe search field).

Scroll down to find methods estimateNbItems andestimateSequencesSize.

Modify the code to print some bank information.

Should look like this :int64_t nbseq = inbank ->estimateNbItems ();

u_int64_t totalsize = inbank ->estimateSequencesSize ();

std::cout << "nbseq : " << nbseq << " totalsize " << totalsize

<< std::endl;

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 19 / 41

Page 21: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

IO: Task 2, write a file

Create a new Bank with new BankFasta constructor:IBank * BankFasta (const std:: string &filename , bool

output_fastq=false , bool output_gz=false)

Insert sequences into this bank with insert method.

Do not forget to flush the bank at the end withflush().

=⇒ You built a fastq/fasta convertor.

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 20 / 41

Page 22: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

IO: Task 2, solution

#include <gatb/gatb_core.hpp >

int main (int argc , char* argv [])

{

IBank * inbank = Bank::open(argv [1]);

IBank * outBank = new BankFasta ("outbank");

Iterator <Sequence >* it = inbank ->iterator ();

for (it->first (); !it ->isDone (); it->next())

{

// Shortcut

Sequence& seq = it->item();

outBank ->insert (seq);

}

outBank ->flush();

}

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 21 / 41

Page 23: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

IO: Task 2The BankFasta constructor has options to output infastq or gz, try this out !

The input bank can be any of fasta, fastq, fasta.gz,fastq.gz, or a list of files.

With comma separated list:./io1 read.fasta ,file2.fastq ,file3.fasta.gz

Or in a file of files, one filename per line:> echo list_of_files

read.fasta

file2.fasta

> ./io1 list_of_files

In that case the iterator on IBank is theconcatenation of banks.

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 22 / 41

Page 24: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

IO: Task 3

This template can be used to filter/transformsequences.

Explore the Sequence API to trim 20 nt at the endof the sequence, and filter out sequences that havemore than 20 ’N’

On a larger file, a progress bar would be nice ! Goto bank5.cpp code snippet (in the gatb doc) to getan example.

Hint: useseq.setDataRef( & seq.getData(),...) totrim and seq.getDataBuffer() to count N’s

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 23 / 41

Page 25: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

IO: Task 3, solution#include <gatb/gatb_core.hpp >

int main (int argc , char* argv [])

{

IBank * inbank = Bank::open(argv [1]);

IBank* outBank = new BankFasta ("outbank");

ProgressIterator <Sequence > *it = new ProgressIterator <Sequence >

(*inbank , "Iterating sequences");

for (it->first (); !it ->isDone (); it->next())

{

Sequence& seq = it->item();

// trim 20 nt at the end

seq.setDataRef( & seq.getData () ,0,seq.getDataSize () -20 );

int nbN = 0; char* data = seq.getDataBuffer ();

for (size_t i=0; i<seq.getDataSize (); i++)

{

if (data[i]==’N’ ) { nbN++; }

}

if(nbN <20)

outBank ->insert (seq);

}

outBank ->flush();

}

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 24 / 41

Page 26: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

Kmerization

Everything in GATB is kmer based !−→ API for kmer extraction .

Open File kmer1.cpp// We declare a kmer model

Kmer <span >:: ModelDirect model (kmerSize);

// and a kmer iterator

Kmer <span >:: ModelDirect :: Iterator itKmer (model);

compile:make kmer1

then launch:./kmer1 ../data/read3.fastq

Observe output.

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 25 / 41

Page 27: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

Canonical kmer

Reads can be on forward or reverse strand, we donot know: each kmer may appear twice.

A widespread concept is to keep only the forward orreverse complement of a kmer.

By convention, the minimum (in lexic. order)between a kmer and its reverse complement is theCanonical kmer

Accessible through ModelCanonical API.

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 26 / 41

Page 28: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

Kmerization

Replace ModelDirect with ModelCanonical

See doc KmerCanonical.

Try itKmer-> value(), forward(), revcomp()

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 27 / 41

Page 29: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

Parallelization

GATB-CORE provides a way to easily parallelizeiterations

Concept of Dispatcher that takes an iterator asinput, creates N thread ; each thread receives andprocesses iterated items from the iterator through afunctor object.

We need to modify previous code a little bit.

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 28 / 41

Page 30: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

Parallelization

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 29 / 41

Page 31: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

Parallelization

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 30 / 41

Page 32: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

Graph creation

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 31 / 41

Page 33: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

Graph creation

From command line:

Graph created with command ./dbgh5 -in

read.fasta

./dbgh5 to get option list.

Use dbginfo -in graph.h5 to print information.

Within code:

Graph::create or Graph::load.

Print graph information with:

cout << graph.getInfo ();

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 32 / 41

Page 34: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

minitool: Read correctionRemember, graph nodes is the set of solid kmers.

solid kmers are error free.

Contains method to query graph for a specifickmer.

In this example, the graph is only used as the set ofsolid kmers (to test membership), edges will not beused here.

Warning: GATB structure answers set membershipqueries exactly only for nodes in the graph. Forother kmers, the structure is probabilistic (but thisis what makes it great) → there might be falsepositives.

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 33 / 41

Page 35: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

Read correction: theory

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 34 / 41

Page 36: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

Read correction: in the .tar file

data/ : example data

solution/ : solutions

samples/ : your working directory

GATB-core is already installed in the VM.

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 35 / 41

Page 37: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

Read correction: task 1

Open minicorrec1.cpp

See documentation for Graph

create a Graph withGraph graph = Graph:: create (" -in %s -abundance -min %d -

kmer -size %d -debloom original",argv[1], 3,kmerSize);

Iterate over canonical kmers of the reads.

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 36 / 41

Page 38: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

Read correction: task 2

Query for each kmer its solidity.

print read profiles in the form of:111111100000001111111Print a ”1” for a solid kmer, a 0 for weak kmer

Hint:

Build a graph node with

// build node from a kmer

Node node(Node::Value(itKmer ->value ()));

Query node solidity with graph.contains(node)

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 37 / 41

Page 39: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

Read correction, task 2 solutionKmer <span >:: ModelCanonical model (kmerSize);

Kmer <span >:: ModelCanonical :: Iterator itKmer (model);

// We loop over sequences.

for (it->first (); !it ->isDone (); it->next())

{

itKmer.setData ((*it)->getData ());

// loop over kmers of the sequence

for (itKmer.first(); !itKmer.isDone (); itKmer.next())

{

// build node from a kmer

Node node(Node::Value(itKmer ->value ()));

if(graph.contains(node))

{

printf("1");

}

else

{

printf("0");

}

}

printf("\n");

}

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 38 / 41

Page 40: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

Read correction: task 2

Sample output:

TGGCTGCCATCAGGAAGAGTTATAACAGGCATTTTATATCCTTATTTGCAGTGGTGACCCACACGAAAGATCACATACAAAGAAAAATTTGTTTATTAAC1111111111111111111111111111111111111111111111111111111111111111111111

GGCACCAATAATTTGCCCATCCACAACAACCGGTACGCCGCCTTCCAGCGACGTTAATTTCGGCGCAGTCACGAACGCGGTACGTCCGTTGTTCACCATC1111111111111111111111111111000000000000000000000000000000001111111111

CAGATCAATGTGCCGGAAGATAACGACATGGTGCTCGCAATGTATGAACGCCTGGGATATGAACACGCCGACGAGCTGAGTCTGGGTAAGCGTTTGATTG1111111001000000000000000000000000001011111111111111111111111100000000

We’ll focus on simple cases: an isolated error generates a’gap’ of k weak kmers.

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 39 / 41

Page 41: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

Read correction: task 3Start from solution/minicorrec2.cpp if lost :)

Detect stretches of k non solid kmers.

infer the position of the error.

change the nucleotide at this pos, build a kmer andquery it in the graph, to find which nucleotide is thecorrect one.Hint:

You can get the sequence in char * withseq.getDataBuffer();

Build a Kmer from a char * with model.codeSeed()Build a Node from a kmer with :Node(Node::Value(putative_corrected_kmer.value ()))

You may start form the easier fileminicorrec2_easier.cpp

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 40 / 41

Page 42: GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1, read a le Open the GATB documentation page Have a look at the IBank interface (type

Read correction

Going further:

correction of close snps, snps near begin/end ofread.

validate correction with multiple overlapping kmers

many algorithms / heuristics are possible.

Bloocoo (G.Benoit, G.Rizk, C.Lemaitre) −− >short read correction

Lordec (L. Salmela, E. Rivals) −− > hybrid pacbiocorrection

G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 41 / 41