GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1,...
Transcript of GATB programming day - Inriagatb-core.gforge.inria.fr/training/gatb-tutorial.pdf · IO: Task 1,...
GATB programming day
G.Rizk, R.Chikhi
Genscale, Rennes
15/06/2016
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 1 / 41
GATB
>read 1ACGACGACGTAGACGACTAGCAAAACTACGATCGACTAT>read 2ACTACTACGATCGATGGTCGCGCTGCTCGCTCTCTCGCT...>read 10.000.000TCTCCTAGCGCGGCGTATACGCTCGCTAGCTACGTAGCT...
➢ NGS technologies produce terabytes of data
➢ Efficient and fast algorithms are essential to analyze such data
➢ The Genome Assembly Tool Box (GATB)
1. Open-source software developed by GENSCALE
2. Easy way to develop efficient and fast NGS tools
3. Based on data structure with a very low memory footprint
4. Complex genomes can be processed on desktop computers
INTRODUCTIONINTRODUCTION
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 2 / 41
GATB
➢ The GATB philosophy: a 3-layer construction to analyze NGS datasets
INTRODUCTIONINTRODUCTION
GATB-CORE
GATB-TOOLS
GATB-PIPELINETHIRD
PARTIES
1. GATB-CORE: a C++ library holding all
the services needed for developing
software dedicated to NGS data
2. GATB-TOOLS: a set of elementary
NGS tools mainly built upon the GATB
library (k-mer counter, contiger,
scaffolder, variant detection, etc.)
3. GATB-PIPELINE: a set of NGS pipeline that links together
tools from the previous layer
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 3 / 41
GATB
GATB-CORE structure for NGS data
➢ Compact de Bruijn graph
➢ Encodes most of the information from the sequencing reads
➢ Graph with low memory footprint
● Complex genomes can be processed on a desktop computer
● A whole human genome handled with 6 GBytes
● The raspberry genome can be assembled on a Raspberry Pi
INTRODUCTIONINTRODUCTION
CGC
GCT
CCG
TCC
CTA
TAT
ATC
CGA
AGC
ATT
GAG
AAA
GGA
TGG
TTG
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 4 / 41
GATB
>read 1 ACGACGACGTAGTAAACTACGATCGACTAT
ACGACGACGTA kmer 1 CGACGACGTAG kmer 2 GACGACGTAGT kmer 3 CGACGTAGTA kmer 4 ... ... ACGATCGACTA kmer 18 CGATCGACTAT kmer 19
➢ Each read is split into words named kmers
➢ A kmer has a fixed size K
➢ Example for K=11
FROM READS TO DE BRUIJN GRAPHFROM READS TO DE BRUIJN GRAPH
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 5 / 41
GATB
➢ Count kmers and keep solid kmers (i.e. present at least N times)
➢ Each solid kmer is inserted as a node of a de Bruijn graph
➢ Nodes A,B are connected <=> suffix(A,K-1) = prefix(B,K-1)
➢ Two reads sharing (K-1) overlapping will be connected in the de Bruijn graph
FROM READS TO DE BRUIJN GRAPHFROM READS TO DE BRUIJN GRAPH
>read 1 ACGACGACGTAGTAAACTACGATCGACTAT
>read 2 CTACGATCGACTATTAGTGATGATAGATAGAT
Kmers specific to read 1
Kmers common to read 1 and read 2
Kmers specific to read 2
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 6 / 41
GATB➢ In a nutshell...
1. Split the reads into kmers
2. Count kmers and keep the solid ones
3. Insert the solid kmers into a de Bruijn graph
>read 1ACGACGACGTAGACGACTAGCTAGCAATGCTAGCTAGGATCAAAACTACGATCGACTAT
>read 2ACTACTACGATCGATGGTCGAGGGCGAGCTAGCTAGCTGACGCTGCTCGCTCTCTCGCT
...
>read 10.000.000TCTCCTAGCGCGGCGTATACGCGCTAAGCTAGCTCTCGCTGCTCGCTAGCTACGTAGCT
...
FROM READS TO DE BRUIJN GRAPHFROM READS TO DE BRUIJN GRAPH
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 7 / 41
GATB
➢ GATB-CORE only stores nodes of the de Bruijn graph, the edges can be computed on the fly when needed.
➢ The nodes are stored in a Bloom filter, a space-efficient structure
➢ Two reasons for a very low memory footprint
FROM READS TO DE BRUIJN GRAPHFROM READS TO DE BRUIJN GRAPH
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 8 / 41
GATB
➢ So, GATB-CORE transforms a set of reads into a compact de Bruijn graph
➢ The graph is saved in HDF5 format
➢ Such a graph can be used by tools developed with the GATB-CORE C++ Library
Gra
ph
(
HD
F5
)
Re
ad
s (
FA
ST
A)
Library
API
Binaries
GATB-CORE
Tool 2
Tool 1
Tool 3
GATB-TOOLS>14_G1511837AGTCGGCTAGCATAGTGCTCAGGAGCTTAAACATGCATGAGAG
>14_G1517637ATCGACTTCTCTTCTTTTCGAGCTTAGCTAATCA
FROM READS TO DE BRUIJN GRAPHFROM READS TO DE BRUIJN GRAPH
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 9 / 41
GATB
➢ GATB-CORE C++ library
● Fast development of efficient tools
● Easy and fast learning curve (full documentation with numerous code samples)
➢ From developers point of view
● don't have to bother with the de Bruijn graph construction
● focus on their own algorithms
➢ GATB-CORE high level API
THE GATB-CORE APITHE GATB-CORE API
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 10 / 41
GATB
➢ The system package holds all the operations related to the operating system: file management, memory management and thread management.
➢ The tools package offers generic operations used throughout user applicative code, but not specific to genomic area.
➢ The bank package provides operations related to standard genomic sequence dataset management. Using this package allows to write algorithms independently of the input format.
➢ The kmer package is dedicated to fine-grained manipulation of k-mers.
➢ The debruijn package provides high-level functions to manipulate the de Bruijn graph data structure
THE GATB-CORE APITHE GATB-CORE API
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 11 / 41
GATB
➢ Details on the graph API (class Graph)
● Iterate all the nodes of a graph
● Iterate branching nodes of a graph
● Get neighbors nodes/edges from a node
● Get in/out degree from a node
● Depth First Search from a node
● Breadth First Search from a node
➢ Two use cases of de Bruijn graph navigation
● Assembly
● Detection of patterns in the graph
THE GATB-CORE APITHE GATB-CORE API
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 12 / 41
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 13 / 41
GATB-core Coding session
Basic input/output15 min break
Kmerization / graph APILunch
Mini tool: error correction.
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 14 / 41
Documentation
Open http://gatb-core.gforge.inria.fr
”How to use the library?” : many code examples.
Use top right search field to find doc for a class.
Sources in the gatb-core library examples/
General design of library, documentation of API.
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 15 / 41
GATB input/output
GATB can :
read files in fasta or fastq format.
read gzipped files directly.
no need to tell which format it is.
read list of files.
write fasta/fastq files.
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 16 / 41
Introduction : code io1.cpp
#include <gatb/gatb_core.hpp >
int main (int argc , char* argv [])
{
IBank * inbank = Bank::open(argv [1]);
Iterator <Sequence >* it = inbank ->iterator ();
for (it->first (); !it ->isDone (); it->next())
{
// Shortcut
Sequence& seq = it->item();
// We dump the data size and the comment
std::cout << "[" << seq.getDataSize () << "]" << seq.
getComment () << std::endl;
// We dump the data
std::cout << seq.toString () << std::endl;
}
}
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 17 / 41
First project
Go to gatb/devel/samples
Compile first example with:make io1
Try with ./io1 read.fastq
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 18 / 41
IO: Task 1, read a fileOpen the GATB documentation pagehttp://gatb-core.gforge.inria.fr
Have a look at the IBank interface (type IBank inthe search field).
Scroll down to find methods estimateNbItems andestimateSequencesSize.
Modify the code to print some bank information.
Should look like this :
int64_t nbseq = inbank ->estimateNbItems ();
u_int64_t totalsize = inbank ->estimateSequencesSize ();
std::cout << "nbseq : " << nbseq << " totalsize " << totalsize
<< std::endl;
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 19 / 41
IO: Task 1, read a fileOpen the GATB documentation pagehttp://gatb-core.gforge.inria.fr
Have a look at the IBank interface (type IBank inthe search field).
Scroll down to find methods estimateNbItems andestimateSequencesSize.
Modify the code to print some bank information.
Should look like this :int64_t nbseq = inbank ->estimateNbItems ();
u_int64_t totalsize = inbank ->estimateSequencesSize ();
std::cout << "nbseq : " << nbseq << " totalsize " << totalsize
<< std::endl;
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 19 / 41
IO: Task 2, write a file
Create a new Bank with new BankFasta constructor:IBank * BankFasta (const std:: string &filename , bool
output_fastq=false , bool output_gz=false)
Insert sequences into this bank with insert method.
Do not forget to flush the bank at the end withflush().
=⇒ You built a fastq/fasta convertor.
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 20 / 41
IO: Task 2, solution
#include <gatb/gatb_core.hpp >
int main (int argc , char* argv [])
{
IBank * inbank = Bank::open(argv [1]);
IBank * outBank = new BankFasta ("outbank");
Iterator <Sequence >* it = inbank ->iterator ();
for (it->first (); !it ->isDone (); it->next())
{
// Shortcut
Sequence& seq = it->item();
outBank ->insert (seq);
}
outBank ->flush();
}
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 21 / 41
IO: Task 2The BankFasta constructor has options to output infastq or gz, try this out !
The input bank can be any of fasta, fastq, fasta.gz,fastq.gz, or a list of files.
With comma separated list:./io1 read.fasta ,file2.fastq ,file3.fasta.gz
Or in a file of files, one filename per line:> echo list_of_files
read.fasta
file2.fasta
> ./io1 list_of_files
In that case the iterator on IBank is theconcatenation of banks.
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 22 / 41
IO: Task 3
This template can be used to filter/transformsequences.
Explore the Sequence API to trim 20 nt at the endof the sequence, and filter out sequences that havemore than 20 ’N’
On a larger file, a progress bar would be nice ! Goto bank5.cpp code snippet (in the gatb doc) to getan example.
Hint: useseq.setDataRef( & seq.getData(),...) totrim and seq.getDataBuffer() to count N’s
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 23 / 41
IO: Task 3, solution#include <gatb/gatb_core.hpp >
int main (int argc , char* argv [])
{
IBank * inbank = Bank::open(argv [1]);
IBank* outBank = new BankFasta ("outbank");
ProgressIterator <Sequence > *it = new ProgressIterator <Sequence >
(*inbank , "Iterating sequences");
for (it->first (); !it ->isDone (); it->next())
{
Sequence& seq = it->item();
// trim 20 nt at the end
seq.setDataRef( & seq.getData () ,0,seq.getDataSize () -20 );
int nbN = 0; char* data = seq.getDataBuffer ();
for (size_t i=0; i<seq.getDataSize (); i++)
{
if (data[i]==’N’ ) { nbN++; }
}
if(nbN <20)
outBank ->insert (seq);
}
outBank ->flush();
}
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 24 / 41
Kmerization
Everything in GATB is kmer based !−→ API for kmer extraction .
Open File kmer1.cpp// We declare a kmer model
Kmer <span >:: ModelDirect model (kmerSize);
// and a kmer iterator
Kmer <span >:: ModelDirect :: Iterator itKmer (model);
compile:make kmer1
then launch:./kmer1 ../data/read3.fastq
Observe output.
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 25 / 41
Canonical kmer
Reads can be on forward or reverse strand, we donot know: each kmer may appear twice.
A widespread concept is to keep only the forward orreverse complement of a kmer.
By convention, the minimum (in lexic. order)between a kmer and its reverse complement is theCanonical kmer
Accessible through ModelCanonical API.
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 26 / 41
Kmerization
Replace ModelDirect with ModelCanonical
See doc KmerCanonical.
Try itKmer-> value(), forward(), revcomp()
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 27 / 41
Parallelization
GATB-CORE provides a way to easily parallelizeiterations
Concept of Dispatcher that takes an iterator asinput, creates N thread ; each thread receives andprocesses iterated items from the iterator through afunctor object.
We need to modify previous code a little bit.
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 28 / 41
Parallelization
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 29 / 41
Parallelization
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 30 / 41
Graph creation
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 31 / 41
Graph creation
From command line:
Graph created with command ./dbgh5 -in
read.fasta
./dbgh5 to get option list.
Use dbginfo -in graph.h5 to print information.
Within code:
Graph::create or Graph::load.
Print graph information with:
cout << graph.getInfo ();
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 32 / 41
minitool: Read correctionRemember, graph nodes is the set of solid kmers.
solid kmers are error free.
Contains method to query graph for a specifickmer.
In this example, the graph is only used as the set ofsolid kmers (to test membership), edges will not beused here.
Warning: GATB structure answers set membershipqueries exactly only for nodes in the graph. Forother kmers, the structure is probabilistic (but thisis what makes it great) → there might be falsepositives.
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 33 / 41
Read correction: theory
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 34 / 41
Read correction: in the .tar file
data/ : example data
solution/ : solutions
samples/ : your working directory
GATB-core is already installed in the VM.
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 35 / 41
Read correction: task 1
Open minicorrec1.cpp
See documentation for Graph
create a Graph withGraph graph = Graph:: create (" -in %s -abundance -min %d -
kmer -size %d -debloom original",argv[1], 3,kmerSize);
Iterate over canonical kmers of the reads.
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 36 / 41
Read correction: task 2
Query for each kmer its solidity.
print read profiles in the form of:111111100000001111111Print a ”1” for a solid kmer, a 0 for weak kmer
Hint:
Build a graph node with
// build node from a kmer
Node node(Node::Value(itKmer ->value ()));
Query node solidity with graph.contains(node)
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 37 / 41
Read correction, task 2 solutionKmer <span >:: ModelCanonical model (kmerSize);
Kmer <span >:: ModelCanonical :: Iterator itKmer (model);
// We loop over sequences.
for (it->first (); !it ->isDone (); it->next())
{
itKmer.setData ((*it)->getData ());
// loop over kmers of the sequence
for (itKmer.first(); !itKmer.isDone (); itKmer.next())
{
// build node from a kmer
Node node(Node::Value(itKmer ->value ()));
if(graph.contains(node))
{
printf("1");
}
else
{
printf("0");
}
}
printf("\n");
}
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 38 / 41
Read correction: task 2
Sample output:
TGGCTGCCATCAGGAAGAGTTATAACAGGCATTTTATATCCTTATTTGCAGTGGTGACCCACACGAAAGATCACATACAAAGAAAAATTTGTTTATTAAC1111111111111111111111111111111111111111111111111111111111111111111111
GGCACCAATAATTTGCCCATCCACAACAACCGGTACGCCGCCTTCCAGCGACGTTAATTTCGGCGCAGTCACGAACGCGGTACGTCCGTTGTTCACCATC1111111111111111111111111111000000000000000000000000000000001111111111
CAGATCAATGTGCCGGAAGATAACGACATGGTGCTCGCAATGTATGAACGCCTGGGATATGAACACGCCGACGAGCTGAGTCTGGGTAAGCGTTTGATTG1111111001000000000000000000000000001011111111111111111111111100000000
We’ll focus on simple cases: an isolated error generates a’gap’ of k weak kmers.
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 39 / 41
Read correction: task 3Start from solution/minicorrec2.cpp if lost :)
Detect stretches of k non solid kmers.
infer the position of the error.
change the nucleotide at this pos, build a kmer andquery it in the graph, to find which nucleotide is thecorrect one.Hint:
You can get the sequence in char * withseq.getDataBuffer();
Build a Kmer from a char * with model.codeSeed()Build a Node from a kmer with :Node(Node::Value(putative_corrected_kmer.value ()))
You may start form the easier fileminicorrec2_easier.cpp
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 40 / 41
Read correction
Going further:
correction of close snps, snps near begin/end ofread.
validate correction with multiple overlapping kmers
many algorithms / heuristics are possible.
Bloocoo (G.Benoit, G.Rizk, C.Lemaitre) −− >short read correction
Lordec (L. Salmela, E. Rivals) −− > hybrid pacbiocorrection
G.Rizk, R.Chikhi (Genscale, Rennes) GATB workshop 15/06/2016 41 / 41