Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many...

19
Report outline Bioinformatics & Algorithms WDCM platform Short description Alignment Assembly Pattern recognition

Transcript of Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many...

Page 1: Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many calculation would we do? Seed match method: substring searching (exact searching) is much

Report outline

Bioinformatics & Algorithms

WDCM platform

Short description

Alignment

Assembly

Pattern recognition

Page 2: Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many calculation would we do? Seed match method: substring searching (exact searching) is much

Description of trimming process

Quality control: fastQC and trimmomatic

remove consecutive N

remove low quality base pairs

remove adapter

remove duplication

remove sequencing error

remove short reads

Page 3: Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many calculation would we do? Seed match method: substring searching (exact searching) is much

Description of alignment and mapping process

Comparative genomics: alignment and mapping

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.

short reads – XXXBWA, Bowtie, SOAPalignerlong sequence – long sequenceMUMmerlong sequence – databaseBLAST family, BLAT, diamond…

Query seq Ref seq

Ref database

Page 4: Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many calculation would we do? Seed match method: substring searching (exact searching) is much

Description of alignment and mapping process

Dynamic programming for alignment

How to evaluate the aligment result?

score the alignment:gap (-1); mismatch (-1); match (+1)

Compare ATAACAT and AGACAT

There are thousands of alignment pattern. Impossible to test them all. => Dynamic programming

Page 5: Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many calculation would we do? Seed match method: substring searching (exact searching) is much

Description of alignment and mapping process

Dynamic programming for alignment

1 – 2 – 6: 2+4=61 – 3 – 6: 4+2=61 – 4 – 6: 3+1=4

best way is 1-4-6, shortest path is 4!

So when we compare the 6 routes, we don't need to calculate the formal part 1-4-6.1 – 2 – 6 – 8 – 101 – 3 – 6 – 8 – 10 4+6+3=131 – 4 – 6 – 8 – 101 – 2 – 6 – 9 – 101 – 3 – 6 – 9 – 10 4+3+4=111 – 4 – 6 – 9 – 10

Page 6: Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many calculation would we do? Seed match method: substring searching (exact searching) is much

Description of alignment and mapping process

Dynamic programming for alignment

How to evaluate the aligment result?

score the alignment:gap (-1); mismatch (-1); match (+1)

A T A A C A T

A 1 0 1 1 0 1 0

G 0 0

A 1

C 0

A 1

T 0

A T

A 1 -1

G -1 0

R:AT T_ ATQ:G_ AG AGMax(-1-1, -1-1, 1-1)=0

A T A A C A T

A 1 0 1 1 0 1 0

G 0 0 0 0 0 0 0

A 1 0 1 1 0 1 0

C 0 0 0 0 2 1 0

A 1 0 1 1 1 3 2

T 0 2 1 0 0 2 4

A T A _ A C A TA G A C A T1 -1 1 1 1 1

local alignment: Smith–Waterman algorithm

Page 7: Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many calculation would we do? Seed match method: substring searching (exact searching) is much

Description of alignment and mapping process

Dynamic programming for alignment

for two string, how many calculation would we do?

Seed match method:substring searching (exact searching) is much easier.

reference is greatly longer than query (len1>>len2)1. find the seed position2. index the reference (less time)2. find the latter part(len1×len2 => len2×len2)

BurrowsWheelerTransfer (BWA/Bowtie)

GGTGCTGCTGGGTTTGTGGCTTTACGCGCGAACCCAGGGCGAGAAAGGACTGGACAAGCTGGTATGAAACGCTGG

GGTGCTGCTGGGTTTGTGGCTTTACGXXXXXXXXXXXXXXXSeed length = 9

search range

mismatch range

Page 8: Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many calculation would we do? Seed match method: substring searching (exact searching) is much

Description of alignment and mapping process

Comparative genomics: alignment and mapping

MUMmer long sequence – long sequence

Page 9: Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many calculation would we do? Seed match method: substring searching (exact searching) is much

Assembly

In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes in one go, but rather reads small pieces of between 20 and 30000 bases, depending on the technology used. Typically the short fragments, called reads, result from shotgun sequencing genomic DNA, or gene transcript (ESTs).

Reference based assemblyDe novo assembly

Why do we need genome assembly process?Increase sequence specificity

Extremely difficult tasks! Why?

Page 10: Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many calculation would we do? Seed match method: substring searching (exact searching) is much

Assembly

CGCGAACCCAGGGCGAGAAAGGAC is the superstrings ofCGCGAACCCCGCGAACCCAGGGCGAGCGCGAACCCAGGGCGAGAAAGGAC

But CGCGAACCCAGGGCGAGGGCGAGAAAGGAC is also the superstring of them.

Finding shortest common superstring problem (SCSP) is the famous NP-C problem.

In computational complexity theory, an NP-complete decision problem is one belonging to both the NP and the NP-hard complexity classes. In this context, NP stands for "nondeterministic polynomial time". The set of NP-complete problems is often denoted by NP-C or NPC.

Garey M R, Johnson D S. A Guide to the Theory of NP-Completeness[J]. WH Freemann, New York, 1979, 70.Gallant J, Maier D, Astorer J. On finding minimal length superstrings[J]. Journal of Computer and System Sciences, 1980, 20(1): 50-58.

Page 11: Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many calculation would we do? Seed match method: substring searching (exact searching) is much

Assembly

CGCGAACCCAGGGCGAGAAAGGAC is the superstrings ofCGCGAACCCCGCGAACCCAGGGCGAGCGCGAACCCAGGGCGAGAAAGGAC

But CGCGAACCCAGGGCGAGGGCGAGAAAGGAC is also the superstring of them.

Finding shortest common superstring problem (SCSP) is the famous NP-C problem.

genome assembly vs. SCSPreverse complimentsequencing errorrepetitive region

SO genome assembly is an extremely HARDER task!

Nagarajan N, Pop M. Sequence assembly demystified[J]. Nature Reviews Genetics, 2013, 14(3).

Page 12: Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many calculation would we do? Seed match method: substring searching (exact searching) is much

Assembly

1. Greedy find overlap the most2. Overlap-Layout-Consensus1st, find the overlap pair-wised2nd, reads => vertex; overlap => edge;

create layout graph3rd, find the best way to traverse all the nodes

NP-hard problemreads number increase => calculation timesreads length decrease => reliability of overlap3. de BruijnGraph1st, cut into k-mer2nd, kmer-1 => vertex; kmer => edge;

create kmer-2 layout graph3rd, find the best way to traverse all the edgesno need to calculation the overlap, it's easy to trace (by hashing).easy to find the best way (by breaking at the fork & ukkonen's condition).

TGGCATTGCAATTGACTGGCATTGCAATTGACTGGCATTGCAATTGACTGGCATTGCAATTGACTGGCATTGCAATTGAC

TGGCATTGCAATTGAC

TGGCA

GCATTGCAA TGCAAT

CAATT ATTGAC

TGG GGC GCA

CAA AAT

ATTCAT TTG TGA GAC

TTT

TGC

TGGC GGCA GCAT CATT ATTG TTGA TGAC

TTGCTGCA

CAATGCAA AATT ATTT TTTG

Page 13: Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many calculation would we do? Seed match method: substring searching (exact searching) is much

Assembly

1. Greedy find overlap the most2. Overlap-Layout-Consensus1st, find the overlap pair-wised2nd, reads => vertex; overlap => edge;

create layout graph3rd, find the best way to traverse all the nodes

NP-hard problemreads number increase => calculation timesreads length decrease => reliability of overlap3. de BruijnGraph1st, cut into k-mer2nd, kmer-1 => vertex; kmer => edge;

create kmer-2 layout graph3rd, find the best way to traverse all the edgesno need to calculation the overlap, it's easy to trace (by hashing).easy to find the best way (by breaking at the fork & ukkonen's condition).

Page 14: Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many calculation would we do? Seed match method: substring searching (exact searching) is much

Assembly errors

TGGCAGCATTGCAATTTGAC

TGG GGC GCA

CAA AAT

ATTCAT TTG TGA GACTGGC GGCA GCAT CATT ATTG TTGA TGAC

CAATGCAA AATT

TGGCACATTGCATTTTTGAC

TGG GGC GCA ATTCAT TTG TGA GAC

TTT

TGGC GGCA GCAT CATT ATTG TTGA TGAC

TGG GGC GCA ATTCAT TTG TGA GAC

TGC

TGGC GGCA GCAT CATT ATTG TTGA TGAC

TTGCTGCA

TGGCATTGCATTGACGCATTG

GCATTG

-M <int> mergeLevel(min 0, max 3): the strength of merging similar sequences during contiging, [1]

-E (optional) merge clean bubble before iterate, works only if -M is set when using multi-kmer, [NO]

-c <float> minContigCvg: minimum contig coverage (c*avgCvg), contigsshorter than 100bp with coverage smaller than c*avgCvg will be masked before scaffolding unless -u is set, [0.1]

-B <float> bubbleCoverage: remove contig with lower cvoerage in bubble structure if both contigs' coverage are smaller than bubbleCoverage*avgCvg, [0.6]

200

170 (variation)17 (sequencing error)

Page 15: Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many calculation would we do? Seed match method: substring searching (exact searching) is much

Paired-end sequencing: library preparation

random breaking

electrophoresis10

0

250

500

recyclingadd adapter

paired-endsequencing insert-length

100 200 300 4000

40000

80000

120000

BGI_CGMCC1.12709

100 200 300 400

040000

80000

120000

BGI_KCTC23076

100 200 300 400

040000

80000

120000

BGI_KCTC23430

Page 16: Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many calculation would we do? Seed match method: substring searching (exact searching) is much

Scaffolding

Reads

Contigs

Scaffolds

kmer

hash table

DBG

paired reads repetitive contig

Insert length

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

linear graph arrangement

Page 17: Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many calculation would we do? Seed match method: substring searching (exact searching) is much

Misassembly & SV detection

Alkan C, Coe BP, Eichler EE: Genome structural variation discovery and genotyping. Nature Reviews Genetics 2011

Page 18: Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many calculation would we do? Seed match method: substring searching (exact searching) is much

Pattern recognition

Classification: (how to separate people from cat, how to separate cat from dog, how to separate husky from wolf)

manual feature setting Semi-auto feature settingData-based feature setting

Eg:a sample: BQGroup1: AQ, QQ, DQ, 6Q, PQGroup2: TX, VN, NM, KL

a sample: BQGroup1: AQ, QQ, DQ, OQ, PQ, BFGroup2: TX, VN, NM, KL, Q$, BB

a sample: BQGroup1: GB, BF, AA, OD, XB, 69, RG, PO, Q4 ...Group2: TX, VN, NM, KL, Q$, BB, 88, BD, ...

Page 19: Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many calculation would we do? Seed match method: substring searching (exact searching) is much

Pattern recognition

Algorithm classification: (how to separate people from cat, how to separate cat from dog, how to separate husky from wolf)

manual feature setting Semi-auto feature settingData-based feature setting

Data | Category MachineLearning

reshape / feature extraction

Discriminant Model

New data

Category of new data