SynaMer A New Application for Rapid Identification of Overlapping n-mers From Sequence Reads

20
Copyright © 2004 Synamatix sdn bhd (538481-U) SynaMer SynaMer A New Application for A New Application for Rapid Identification of Rapid Identification of Overlapping n-mers From Sequence Reads Overlapping n-mers From Sequence Reads June 2006

description

SynaMer A New Application for Rapid Identification of Overlapping n-mers From Sequence Reads June 2006. Synamatix team - Introductions. Colin Hercus CTO Poh Yang Ming Bioinformatics Research Team Member Arif Anwar VP. Summary of Agenda. Overview of Genome assembly Key bottlenecks - PowerPoint PPT Presentation

Transcript of SynaMer A New Application for Rapid Identification of Overlapping n-mers From Sequence Reads

Page 1: SynaMer A New Application for  Rapid Identification of Overlapping n-mers From Sequence Reads

Copyright © 2004 Synamatix sdn bhd (538481-U)

SynaMerSynaMer

A New Application for A New Application for

Rapid Identification ofRapid Identification of

Overlapping n-mers From Sequence Reads Overlapping n-mers From Sequence Reads

June 2006

Page 2: SynaMer A New Application for  Rapid Identification of Overlapping n-mers From Sequence Reads

www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)

Synamatix team - IntroductionsSynamatix team - Introductions

Colin HercusCTO

Poh Yang MingBioinformatics Research Team Member

Arif AnwarVP

Page 3: SynaMer A New Application for  Rapid Identification of Overlapping n-mers From Sequence Reads

www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)

Summary of AgendaSummary of Agenda

Overview of Genome assembly

Key bottlenecks

Introducing SynaMer:A solution for rapidly finding longer overlapping n-mersThe methodThe resultsDiscussion

Page 4: SynaMer A New Application for  Rapid Identification of Overlapping n-mers From Sequence Reads

www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)

2 major approaches

Ab initioAb initio genome assemblygenome assembly

Overlap-layout-consensus

Needs high sequence coverage

No requirement for closely related genome

Comparative genome Comparative genome assemblyassembly

Alignment-layout-consensus

Requires a closely related genome

High speed sequence read to genome mapping is

required

Less dependent on overlap finding

Page 5: SynaMer A New Application for  Rapid Identification of Overlapping n-mers From Sequence Reads

www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)

Identified bottlenecks for Identified bottlenecks for Ab initioAb initio

Typical genome assembly process flowSequence reads/FragmentsVector trimmingOverlappingContig/Supercontig/Scaffold generationFinishingFinal Genome

User* identified major bottleneck in n-mer finding:PerformancePreference for longer n-mersIT Hardware requirements

User* - Major US Genome Research Institute

Page 6: SynaMer A New Application for  Rapid Identification of Overlapping n-mers From Sequence Reads

www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)

Task to accomplishTask to accomplish

Original user data set and requirement was:To find all overlapping exact 100-mers in 50million 1kb sequencing reads – i.e. 50 Billion bpReport n-mers that have a frequency >2 and <m

Using conventional software and approaches the user took 500hrs and 1.5TB of disc space to find all 100-mer overlaps

Hence standard approach limits usage to 32mers

Longer mers help bridge repetitive and low-complexity regions

Page 7: SynaMer A New Application for  Rapid Identification of Overlapping n-mers From Sequence Reads

www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)

Long v Short n-mersLong v Short n-mersadvantages and disadvantages

100 mer

+ve

-ve

Fewer false positives

Improvement in final assembly

Errors in reads may lead to false negatives

Slow to process with conventional software

Page 8: SynaMer A New Application for  Rapid Identification of Overlapping n-mers From Sequence Reads

www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)

Explanation of advantagesExplanation of advantages

Low-complexity region

A shorter overlap results in more false

positives

A longer overlap results in less false

positives

Final assembly improved

A

B

Page 9: SynaMer A New Application for  Rapid Identification of Overlapping n-mers From Sequence Reads

www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)

Synamer: Synamer: A solution for rapid identification of longer n-A solution for rapid identification of longer n-

mersmers

Synamer finds overlapping sequences given a defined “n” with a range of frequency of occurrence in the sequence set

It is similar to a class of tools in genome assembly called “overlappers”

2 well known overlappers are:UMD Overlapper

Roberts  M et. al.(2004) Bioinformatics 20(18):3363-3369

KI OverlapperTammi MT et.al., (2003) NAR 31(15):4663-4672

Page 10: SynaMer A New Application for  Rapid Identification of Overlapping n-mers From Sequence Reads

www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)

How Synamer worksHow Synamer works

Given a mer length of “n”Extract a n-mer at each position within a readCompare the n-mer and reverse complement, to report palindromesIndex n-mers and their location within readsFor each n-mer within a user defined frequency range report the n-mer and locations

Page 11: SynaMer A New Application for  Rapid Identification of Overlapping n-mers From Sequence Reads

www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)

SynaMer dataSynaMer dataInput:

Text file of the reads

Parameters:n (default 96, maximum of 128)Frequency range (default of 2 to 50)Memory usage (Default to available memory)Temporary file location

Output Format:Text or binaryn-mer Frequency Palindrome direction read ID:location

Page 12: SynaMer A New Application for  Rapid Identification of Overlapping n-mers From Sequence Reads

www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)

Test casesTest casesUser case 1: 30million 1kb reads finding exact 96mer took approximately 5hrs to process, with less than 200GB temporary disk space on a dual CPU Itanium

Compared to 500hrs and over 1.5TB of disk space

Use case 2: Brucella_suis 1330, 36080 900bp reads (http://www.tigr.org/tdb/benchmark/)Tests were conducted with a range of n-mer with frequency of minimum of 2 to 120.n-mer range of: 12, 24, 36, 48, 60, 72, 84, 96, 108, 120Average execution time measured with 6 replicates

Page 13: SynaMer A New Application for  Rapid Identification of Overlapping n-mers From Sequence Reads

www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)

Brucella suis - resultsBrucella suis - results

Majority of the patterns are at frequency of 2-50More pattern at higher n-merLonger n-mer would be more specific and less false positive

Brucella Suis 1330

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

0 2 10 20 30 40 50 60 70 80 90

Fre que ncy, m

Nu

mb

er o

f O

verl

app

ing

Seq

uen

ce

12

24

36

96

Page 14: SynaMer A New Application for  Rapid Identification of Overlapping n-mers From Sequence Reads

www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)

Distribution of overlapping Distribution of overlapping sequence with frequencysequence with frequency

Mammalian genome

0.00E+00

5.00E+11

1.00E+12

1.50E+12

2.00E+12

2.50E+12

3.00E+12

3.50E+12

0 2 10 20 30 40 50 60 70 80 90

Frequency, m

Nu

mb

er

of

ov

erl

ap

pin

g s

eq

ue

nc

es

12

24

36

96

Higher level of repeats in more complex genomes leads to increased benefits from using longer n-mers

Page 15: SynaMer A New Application for  Rapid Identification of Overlapping n-mers From Sequence Reads

www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)

Using SynaMer there is no time Using SynaMer there is no time increase with longer n-mersincrease with longer n-mers

Time vs n-mer (m 2 to 50)

0

5

10

15

20

25

0 20 40 60 80 100 120 140

n-mer

Tim

e, S

Page 16: SynaMer A New Application for  Rapid Identification of Overlapping n-mers From Sequence Reads

www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)

Sample OutputSample Output

At 96-mer:

TTTCATAAAGCCGCTTTGCACCATAAAGCGCGTCGCCGGTGCTGCCTGTGGTGCCGTAGAAAGTCCAGCCTTCCTCCGCCATCAGGAAATCAACCACTGAAACGGAAA 5 33984:395 25036:255 17186:435 -5741:85 5184:181TTTCATAAACCTGACCCTGATTCGCCGCACCATCGCCGAAATAGGTCAGCGAAACGGATTTATTCTCACGATAGTGATTGGCGAAGGCCAACCCCGTACCGAGCGAAA 8 30929:163 28279:329 25051:228 -22556:257 -14554:249 -12303:286 15820:325 6770:434TTTCATAAAACCTAAATAATATAGAATATATTTTTTAATTTACTCCCACAAAAATTGATATTTATAAAATAAAAAATCCCAATCTGTAAATCCCAATAATTTTACAAA 4 32618:184 -9587:456 9891:617 8902:369TTTCAGTTTCTCAAGCAAACCCTTTATGACATTGCATCTTTGCTGGTGTTTTTCGCCAATGTTGCATTTTGTTTCTCAATTGTAGCGCAAGCAAATGCGGCTTGAAAA 5 26073:487 -21045:262 22952:244 12603:19 6640:383

The numbers before the “:” are the ordinal position of the reads in the file

Page 17: SynaMer A New Application for  Rapid Identification of Overlapping n-mers From Sequence Reads

www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)

Sample contigSample contig

To show validity of the resultGBUAS15TR and GBUCA37TFDetected overlap at 96-mer – shown below:At position 188 on GBUAS15TR and 811 on GBUCA37TFThey can be joined to a 1.5kbp contig, with consensus

Page 18: SynaMer A New Application for  Rapid Identification of Overlapping n-mers From Sequence Reads

www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)

ConclusionsConclusions

For 30million 1kb reads took 5 hours on a dual CPU itanium

machine, with temporary file size less than 200GB

Time consumed to find overlapping sequences for 33000

900bp reads of a bacterial WGSS reads took less than 20s

100 fold faster than conventional method

Allows use of longer n-mers

Potentially increases quality of assembly

SynaMer will be made released as a product later this

Summer

Page 19: SynaMer A New Application for  Rapid Identification of Overlapping n-mers From Sequence Reads

www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)

Questions and Follow upQuestions and Follow up

Please send questions to: [email protected]

Webcast will be available online in 24hrs at www.mgrc.com.my

Paper accompanying this webcast will be sent to all attendees

If you are interested in testing SynaMer when it is released please email: [email protected]

Page 20: SynaMer A New Application for  Rapid Identification of Overlapping n-mers From Sequence Reads

www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)

Thank you!

[email protected]