Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many...
Transcript of Report outline - WDCM 20… · Dynamic programming for alignment for two string, how many...
Report outline
Bioinformatics & Algorithms
WDCM platform
Short description
Alignment
Assembly
Pattern recognition
Description of trimming process
Quality control: fastQC and trimmomatic
remove consecutive N
remove low quality base pairs
remove adapter
remove duplication
remove sequencing error
remove short reads
Description of alignment and mapping process
Comparative genomics: alignment and mapping
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.
short reads – XXXBWA, Bowtie, SOAPalignerlong sequence – long sequenceMUMmerlong sequence – databaseBLAST family, BLAT, diamond…
Query seq Ref seq
Ref database
Description of alignment and mapping process
Dynamic programming for alignment
How to evaluate the aligment result?
score the alignment:gap (-1); mismatch (-1); match (+1)
Compare ATAACAT and AGACAT
There are thousands of alignment pattern. Impossible to test them all. => Dynamic programming
Description of alignment and mapping process
Dynamic programming for alignment
1 – 2 – 6: 2+4=61 – 3 – 6: 4+2=61 – 4 – 6: 3+1=4
best way is 1-4-6, shortest path is 4!
So when we compare the 6 routes, we don't need to calculate the formal part 1-4-6.1 – 2 – 6 – 8 – 101 – 3 – 6 – 8 – 10 4+6+3=131 – 4 – 6 – 8 – 101 – 2 – 6 – 9 – 101 – 3 – 6 – 9 – 10 4+3+4=111 – 4 – 6 – 9 – 10
Description of alignment and mapping process
Dynamic programming for alignment
How to evaluate the aligment result?
score the alignment:gap (-1); mismatch (-1); match (+1)
A T A A C A T
A 1 0 1 1 0 1 0
G 0 0
A 1
C 0
A 1
T 0
A T
A 1 -1
G -1 0
R:AT T_ ATQ:G_ AG AGMax(-1-1, -1-1, 1-1)=0
A T A A C A T
A 1 0 1 1 0 1 0
G 0 0 0 0 0 0 0
A 1 0 1 1 0 1 0
C 0 0 0 0 2 1 0
A 1 0 1 1 1 3 2
T 0 2 1 0 0 2 4
A T A _ A C A TA G A C A T1 -1 1 1 1 1
local alignment: Smith–Waterman algorithm
Description of alignment and mapping process
Dynamic programming for alignment
for two string, how many calculation would we do?
Seed match method:substring searching (exact searching) is much easier.
reference is greatly longer than query (len1>>len2)1. find the seed position2. index the reference (less time)2. find the latter part(len1×len2 => len2×len2)
BurrowsWheelerTransfer (BWA/Bowtie)
GGTGCTGCTGGGTTTGTGGCTTTACGCGCGAACCCAGGGCGAGAAAGGACTGGACAAGCTGGTATGAAACGCTGG
GGTGCTGCTGGGTTTGTGGCTTTACGXXXXXXXXXXXXXXXSeed length = 9
search range
mismatch range
Description of alignment and mapping process
Comparative genomics: alignment and mapping
MUMmer long sequence – long sequence
Assembly
In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes in one go, but rather reads small pieces of between 20 and 30000 bases, depending on the technology used. Typically the short fragments, called reads, result from shotgun sequencing genomic DNA, or gene transcript (ESTs).
Reference based assemblyDe novo assembly
Why do we need genome assembly process?Increase sequence specificity
Extremely difficult tasks! Why?
Assembly
CGCGAACCCAGGGCGAGAAAGGAC is the superstrings ofCGCGAACCCCGCGAACCCAGGGCGAGCGCGAACCCAGGGCGAGAAAGGAC
But CGCGAACCCAGGGCGAGGGCGAGAAAGGAC is also the superstring of them.
Finding shortest common superstring problem (SCSP) is the famous NP-C problem.
In computational complexity theory, an NP-complete decision problem is one belonging to both the NP and the NP-hard complexity classes. In this context, NP stands for "nondeterministic polynomial time". The set of NP-complete problems is often denoted by NP-C or NPC.
Garey M R, Johnson D S. A Guide to the Theory of NP-Completeness[J]. WH Freemann, New York, 1979, 70.Gallant J, Maier D, Astorer J. On finding minimal length superstrings[J]. Journal of Computer and System Sciences, 1980, 20(1): 50-58.
Assembly
CGCGAACCCAGGGCGAGAAAGGAC is the superstrings ofCGCGAACCCCGCGAACCCAGGGCGAGCGCGAACCCAGGGCGAGAAAGGAC
But CGCGAACCCAGGGCGAGGGCGAGAAAGGAC is also the superstring of them.
Finding shortest common superstring problem (SCSP) is the famous NP-C problem.
genome assembly vs. SCSPreverse complimentsequencing errorrepetitive region
SO genome assembly is an extremely HARDER task!
Nagarajan N, Pop M. Sequence assembly demystified[J]. Nature Reviews Genetics, 2013, 14(3).
Assembly
1. Greedy find overlap the most2. Overlap-Layout-Consensus1st, find the overlap pair-wised2nd, reads => vertex; overlap => edge;
create layout graph3rd, find the best way to traverse all the nodes
NP-hard problemreads number increase => calculation timesreads length decrease => reliability of overlap3. de BruijnGraph1st, cut into k-mer2nd, kmer-1 => vertex; kmer => edge;
create kmer-2 layout graph3rd, find the best way to traverse all the edgesno need to calculation the overlap, it's easy to trace (by hashing).easy to find the best way (by breaking at the fork & ukkonen's condition).
TGGCATTGCAATTGACTGGCATTGCAATTGACTGGCATTGCAATTGACTGGCATTGCAATTGACTGGCATTGCAATTGAC
TGGCATTGCAATTGAC
TGGCA
GCATTGCAA TGCAAT
CAATT ATTGAC
TGG GGC GCA
CAA AAT
ATTCAT TTG TGA GAC
TTT
TGC
TGGC GGCA GCAT CATT ATTG TTGA TGAC
TTGCTGCA
CAATGCAA AATT ATTT TTTG
Assembly
1. Greedy find overlap the most2. Overlap-Layout-Consensus1st, find the overlap pair-wised2nd, reads => vertex; overlap => edge;
create layout graph3rd, find the best way to traverse all the nodes
NP-hard problemreads number increase => calculation timesreads length decrease => reliability of overlap3. de BruijnGraph1st, cut into k-mer2nd, kmer-1 => vertex; kmer => edge;
create kmer-2 layout graph3rd, find the best way to traverse all the edgesno need to calculation the overlap, it's easy to trace (by hashing).easy to find the best way (by breaking at the fork & ukkonen's condition).
Assembly errors
TGGCAGCATTGCAATTTGAC
TGG GGC GCA
CAA AAT
ATTCAT TTG TGA GACTGGC GGCA GCAT CATT ATTG TTGA TGAC
CAATGCAA AATT
TGGCACATTGCATTTTTGAC
TGG GGC GCA ATTCAT TTG TGA GAC
TTT
TGGC GGCA GCAT CATT ATTG TTGA TGAC
TGG GGC GCA ATTCAT TTG TGA GAC
TGC
TGGC GGCA GCAT CATT ATTG TTGA TGAC
TTGCTGCA
TGGCATTGCATTGACGCATTG
GCATTG
-M <int> mergeLevel(min 0, max 3): the strength of merging similar sequences during contiging, [1]
-E (optional) merge clean bubble before iterate, works only if -M is set when using multi-kmer, [NO]
-c <float> minContigCvg: minimum contig coverage (c*avgCvg), contigsshorter than 100bp with coverage smaller than c*avgCvg will be masked before scaffolding unless -u is set, [0.1]
-B <float> bubbleCoverage: remove contig with lower cvoerage in bubble structure if both contigs' coverage are smaller than bubbleCoverage*avgCvg, [0.6]
200
170 (variation)17 (sequencing error)
Paired-end sequencing: library preparation
random breaking
electrophoresis10
0
250
500
recyclingadd adapter
paired-endsequencing insert-length
100 200 300 4000
40000
80000
120000
BGI_CGMCC1.12709
100 200 300 400
040000
80000
120000
BGI_KCTC23076
100 200 300 400
040000
80000
120000
BGI_KCTC23430
Scaffolding
Reads
Contigs
Scaffolds
kmer
hash table
DBG
paired reads repetitive contig
Insert length
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8
linear graph arrangement
Misassembly & SV detection
Alkan C, Coe BP, Eichler EE: Genome structural variation discovery and genotyping. Nature Reviews Genetics 2011
Pattern recognition
Classification: (how to separate people from cat, how to separate cat from dog, how to separate husky from wolf)
manual feature setting Semi-auto feature settingData-based feature setting
Eg:a sample: BQGroup1: AQ, QQ, DQ, 6Q, PQGroup2: TX, VN, NM, KL
a sample: BQGroup1: AQ, QQ, DQ, OQ, PQ, BFGroup2: TX, VN, NM, KL, Q$, BB
a sample: BQGroup1: GB, BF, AA, OD, XB, 69, RG, PO, Q4 ...Group2: TX, VN, NM, KL, Q$, BB, 88, BD, ...
Pattern recognition
Algorithm classification: (how to separate people from cat, how to separate cat from dog, how to separate husky from wolf)
manual feature setting Semi-auto feature settingData-based feature setting
Data | Category MachineLearning
reshape / feature extraction
Discriminant Model
New data
Category of new data