Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the...
-
Upload
shannon-simmons -
Category
Documents
-
view
213 -
download
0
Transcript of Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the...
Pattern Matching
Rhys Price Jones
Anne R. Haake
What is pattern matching?
• Pattern matching is the procedure of scanning a nucleic acid or protein sequence for matches to short sequence patterns (Staden 1990).
Why search for patterns?
• Usually the sequences of interest (the query sequences) are known to be indicators of some important biological function
• Search for patterns in nucleotide sequence– DNA or RNA
• Search for patterns in amino acid sequence
Motif
• multiples uses of the word• Def: a pattern; typically is used to refer to a
short (up to ten bases or residues) repeated or conserved pattern in nucleic acids or proteins
• Def: a short conserved sequence in a protein; usually associated with function – in a broader sense, motif is used for all localized
regions of homology, regardless of size
Some examples of patterns in DNA sequence:
• Restriction sites:recognition sites for the restriction endonucleases
• Intron splice sites• Codons specifying ORFs• Promoters• DNA binding sites for regulatory
proteins
Restriction Sites
• Why identify them?
• Exact or inexact matches?
• Examples:
Restriction sites
Splice Sites• Splice donor and splice acceptor• are consensus sequences
– A statistical determination of the pattern;approximates the pattern
• C(orA)AG/GTA(orG)AGT "donor" splice site • T(orC)nNC(orT)AG/G "acceptor" splice site• Splice site example
Splice Sites
• Remember that they are consensus sequences• Why are splice sites of interest?
– Gene finding– Mutations in consensus sequence at the splice junctions
common in many inherited disorders• Ex: thalassemias, muscular dystrophy, Tay-Sachs,
neurofibromatosis, Darier’s disease……..
• One of the thalassemias: mutation at splice acceptor
YYYNCAG| normal YYYNCGG| mutant
Codons Specifying ORFs
• ORFs (open reading frames)• Start codon ….60-100 a.a’s and no stop
codon• Prokaryotic start codons: ATG, GTG or TTG
usually, but is species specific• Eukaryotic start: ATGCode table• More on this, too, when we discuss gene
finding
Promoters
• Prokaryotic promoters: Consensus sequences– TTGACA ---- 17±1 ---- TATAAT -35 -10
• Eukaryotic promoters– TATA box at –25 relative to transcriptional start site
• consensus is 5’-TATAWAW-3’ (W= A or T)
– Initiator sequence(Inr)• consensus is 5’-YYCARR-3’ (Y is C or T; R is G or A) • the +1 nucleotide (start) is usually the A of the Inr sequence
• Bind basal transcription factors
– We’ll revisit this when we discuss gene finding
Transcription Factor Binding Sites
• Regulatory transcription factors are sequence-specific DNA-binding proteins; sites are often found in or near gene promoter regions
• DNA sequence is called the response element
• What are the DNA sequences like?
Response elements
Some examples of patterns in protein sequences (motifs):
• Prediction of secondary and tertiary structure
– e.g. transcription factors• helix-turn-helix, b-zip, zinc-finger
Examples
• Presence of active sites of enzymes• Presence of cell localization signals
Exact vs Inexact (Approximate) Pattern Matching
• Exact Pattern Matching– Limited use in bioinformatics– Well-known algorithms (last week)– A common use of exact pattern matching is to
compare a sequence against a large number of possible known patterns such as in the identification of restriction sites
• Approximate– Most of the other examples of pattern matching in
bioinformatics
Other uses of exact pattern matching?
• Check PCR primers?
• Annotation? (text matching)
Why search for patterns?
• Pattern matching in sequences is also the basis of searching through a sequence database – Sequence alignment
Pairwise Sequence Alignment
• An alignment between 2 sequences is a pairwise match between sequences.
• Pairwise sequence comparison is the primary means of linking biological function to the genome and of propagating known information from one genome to another (Gibas & Jambeck)
.
Why are inexact pattern matches relevant in sequence alignments?
• Sequencing errors• Mutation
– 2 primary types• point mutations (affect a single nucleotide)• segmental mutations (affect a few to hundreds of
adjoining nucleotides)
– substitutions (transitions, transversions)– insertions, deletions
Mutations
• Point mutations usually occur from a nucleotide mismatch that becomes “fixed” during the process of replication– Escapes the DNA repair mechanism
• Significant when occur within a coding region and also cause a change in functionality– Non-synonymous mutation – Synonymous mutation: mutated sequence codes for same
amino acid as before mutation– Allowance for synonymous mutation due to wobble and
degeneracy of the codeCode Table
Evolutionary Considerations
• Through time mutations tend to be preserved if they are not deleterious
• Functionally important sequences tend to be conserved
• Non-functional or non-coding sequences diverge at a high rate
Evolutionary Considerations
• The tendency of functionally important sequences to remain relatively unchanged over time is the basis for sequence analysis– Allows us to draw evolutionary connections among
genes that are related in sequence