Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored...
Transcript of Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored...
Genomes Comparision via de Bruijngraphs
Student: Ilya MinkinAdvisor: Son Pham
St. Petersburg Academic University
June 4, 2012
1 / 19
Synteny Blocks: Algorithmic challenge
I Suppose that we are given two genomes
I The question is: how are they evolutionaryrelated to each other?
I In order to do rearrangements analysis we mustdecompose genomes into synteny blocks
I Synteny blocks are evolutionary conservedsegments of the genome
I These blocks cover most of the genome
I Occur in both genomes with possible variations
2 / 19
Academic Project
Project: Identify synteny blocks for duplicatedgenomes represented as sequences of nucleotides.
I None of the previous synteny blocksreconstruction software (DRIMM-Synteny(Pham And Pevzner 2010) included) canefficiently solve this problem.
I DRIMM-Synteny can find the synteny blocksfor complicated genomes. But:
I It requires the genome to be represented assequence of genes.
3 / 19
Academic Project
Project: Identify synteny blocks for duplicatedgenomes represented as sequences of nucleotides.
I None of the previous synteny blocksreconstruction software (DRIMM-Synteny(Pham And Pevzner 2010) included) canefficiently solve this problem.
I DRIMM-Synteny can find the synteny blocksfor complicated genomes. But:
I It requires the genome to be represented assequence of genes.
3 / 19
General Idea: de Bruijn GraphI We are given an alphabet Σ and a string S
over it, |Σ| = mI A substring T , |T | = k is called k-merI De Bruijn graph is a multigraph Gk = (V ,E ),
whereV = Σk−1 = {all possible (k − 1)-mers}
I If k-mer T is presented in S , then we add anoriented edge (T [1, k − 1],T [2, k]) to thegraph
I Create de Bruijn graph from the nucleotidesequence
I Conserved regions will yield non-branchingpaths
4 / 19
Challenges
I Variations in synteny blocks generate cycles, sowe need to simplify the graph
I Double strandness: conserved regions mayoccur on both strands. Example:5’ AACCGGTT 3’3’ TTGGCCAA 5’Such blocks are reverse complementary to eachother ⇒ no non-branching paths
I Spurious similarity
I Memory efficiency
5 / 19
Colored graphI We use colored de Bruijn graphs
[Iqball et al., 2012] to handle double-strandness
I Suppose that S+ and S− are positive andnegative strands of the chromosome
I Colored de Bruijn graph is a multigraphGk = (V ,E ) where V = Σk−1
I For each k-mer T+ in S+ add edge(T+[1, k − 1],T+[2, k]) to Gk and mark itblue
I For each k-mer T− in S− add edge(T−[1, k − 1],T−[2, k]) to Gk and mark itred
6 / 19
Edge labeling
I Note that our graph is built from a string, notset of reads
I Each walk in the graph represents a string
I We are interested only in walks that representsubstrings of the source string
I Assign to each edge e label L(e) = position ofthe corresponding k-mer on the positive strand
I Walk W = (v1 e1 v2 e2 ...) is considered valid iff:1. ei and ei+1 are of the same color2. |L(ei)− L(ei+1)| = 1
7 / 19
Example
ac ct7
cc0
ca
3
tg2
6
1 ag
6
2gt
3
ga
5
tc4
4
5
7
gg
10
5' ACCTGTCAGT 3'3' TGGACAGTCA 5'
Figure 1: Colored de Bruijn graph built from two strands
8 / 19
Graph simplificationI Bulges spoil long non-branching paths and
indicate indels/mismatchesI A pair of walks (W1,W2) is a bulge iff:
1) Start and end vertices of W1 and W2
coincide2) W1 and W2 have exactly 2 common vertices3) There are no edges u ∈ W1 and v ∈ W2
such that L(u) = L(v)4) |W1| ≤ δ and |W2| ≤ δ
...
...
U V
Figure 2: A bulge 9 / 19
General pipelineI Build de Bruijn graph from the genomeI Remove bulges (BFS-like algorithm)I Bulges are removed by replacing long branches
with shorter onesI Output non-branching paths
A CB
X Y
A CB
Figure 3: Bulge removal illustration10 / 19
Parameters selection
I How should we choose K and δ?
I Duplicated genes can have no long (K > 50)shared K - mers
I Big K ∼ 50 – we find only few synteny blocks
I Small K ∼ 10 and small δ ∼ 15 – we find veryshort synteny blocks
I Small K ∼ 10 and big δ ∼ 200 – the genomewill be disrupted completely
I Solution – do simplification in multiple stages
11 / 19
Parameters selection
I How should we choose K and δ?
I Duplicated genes can have no long (K > 50)shared K - mers
I Big K ∼ 50 – we find only few synteny blocks
I Small K ∼ 10 and small δ ∼ 15 – we find veryshort synteny blocks
I Small K ∼ 10 and big δ ∼ 200 – the genomewill be disrupted completely
I Solution – do simplification in multiple stages
11 / 19
New pipeline
I General idea – ”align” similar regions first, thenglue them together into synteny blocks
I Start with small K and small δ to smoothduplicated regions and obtain long K -mers
I Rebuild and simplify the graph with higher Kand δ
I Continue this process several times
I Final step can be done with K ∼ severalhundreds
12 / 19
ExperimentI We have attempted to identify duplications inArabidopsis thaliana
I Arabidopsis is known to be highly duplicatedgenome [Arabidopsis Genome Initiative]
I Size of the genome is ∼ 120MbpI We used 4 stages and following parameters:
Stage number K δ1 15 1502 50 5003 100 10004 500 5000
13 / 19
Computation results
I We have found 4722 synteny blocks inArabidopsis
I These blocks cover 28 % of the genome
I Minimum length of the block is 1000 bp
I Largest block found has length ∼ 95 000 bp
I We tried to verify blocks by aligning instancesof the same block
I At least 87 % of blocks have 50 % of exactmatches
14 / 19
Computation results
Figure 4: Matches percent vs. number of blocks plot15 / 19
Computation results
Figure 5: Synteny blocks length distribution16 / 19
Summary
I We have covered 28 % of Arabidopsis genomewith synteny blocks
I But we have missed some duplicated regions,described in [Arabidopsis Genome Initiative]
I Most of the blocks are short (< 5000 bp)
I We must improve coverage and ”lengthen” theblocks
Near plans:
I Improve performance
I Examine other genomes
I Optimize algorithms to handle larger genomes
17 / 19
Summary
I We have covered 28 % of Arabidopsis genomewith synteny blocks
I But we have missed some duplicated regions,described in [Arabidopsis Genome Initiative]
I Most of the blocks are short (< 5000 bp)
I We must improve coverage and ”lengthen” theblocks
Near plans:
I Improve performance
I Examine other genomes
I Optimize algorithms to handle larger genomes
17 / 19
Applications
I A multiple sequence alignment program
I Finding the synteny blocks for complicatedgenomes (Mammalian), possible collaboration– Jian Ma.
I Tool for genomes vs genomes and/orassemblies vs. assemblies and/or assemblies vs.genomes comparisions.
18 / 19
ReferencesI 1. Pevzner P and Tesler G, (2003) Human and
mouse genomic sequences reveal extensivebreakpoint reuse in mammalian evolution.
I 2. Pham S and Pevzner P, (2010)DRIMM-Synteny: Decomposing Genomes intoEvolutionary Conserved Segments
I 3. Iqbal Z, Caccamo M, Turner I, Flicek P,McVean G, (2012) De novo assembly andgenotyping of variants using colored de Bruijngraphs
I 4. Arabidopsis Genome Initiative, (2000)Analysis of the genome sequence of theflowering plant Arabidopsis thaliana
19 / 19
Thank you!
19 / 19