Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the...

20
Pattern Matching Rhys Price Jones Anne R. Haake

Transcript of Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the...

Page 1: Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

Pattern Matching

Rhys Price Jones

Anne R. Haake

Page 2: Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

What is pattern matching?

• Pattern matching is the procedure of scanning a nucleic acid or protein sequence for matches to short sequence patterns (Staden 1990).

Page 3: Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

Why search for patterns?

• Usually the sequences of interest (the query sequences) are known to be indicators of some important biological function

• Search for patterns in nucleotide sequence– DNA or RNA

• Search for patterns in amino acid sequence

Page 4: Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

Motif

• multiples uses of the word• Def: a pattern; typically is used to refer to a

short (up to ten bases or residues) repeated or conserved pattern in nucleic acids or proteins

• Def: a short conserved sequence in a protein; usually associated with function – in a broader sense, motif is used for all localized

regions of homology, regardless of size

Page 5: Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

Some examples of patterns in DNA sequence:

• Restriction sites:recognition sites for the restriction endonucleases

• Intron splice sites• Codons specifying ORFs• Promoters• DNA binding sites for regulatory

proteins

Page 6: Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

Restriction Sites

• Why identify them?

• Exact or inexact matches?

• Examples:

Restriction sites

Page 7: Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

Splice Sites• Splice donor and splice acceptor• are consensus sequences

– A statistical determination of the pattern;approximates the pattern

• C(orA)AG/GTA(orG)AGT "donor" splice site • T(orC)nNC(orT)AG/G "acceptor" splice site• Splice site example

Page 8: Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

Splice Sites

• Remember that they are consensus sequences• Why are splice sites of interest?

– Gene finding– Mutations in consensus sequence at the splice junctions

common in many inherited disorders• Ex: thalassemias, muscular dystrophy, Tay-Sachs,

neurofibromatosis, Darier’s disease……..

• One of the thalassemias: mutation at splice acceptor

YYYNCAG| normal YYYNCGG| mutant

Page 9: Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

Codons Specifying ORFs

• ORFs (open reading frames)• Start codon ….60-100 a.a’s and no stop

codon• Prokaryotic start codons: ATG, GTG or TTG

usually, but is species specific• Eukaryotic start: ATGCode table• More on this, too, when we discuss gene

finding

Page 10: Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

Promoters

• Prokaryotic promoters: Consensus sequences– TTGACA ---- 17±1 ---- TATAAT -35 -10

• Eukaryotic promoters– TATA box at –25 relative to transcriptional start site

• consensus is 5’-TATAWAW-3’ (W= A or T)

– Initiator sequence(Inr)• consensus is 5’-YYCARR-3’ (Y is C or T; R is G or A) • the +1 nucleotide (start) is usually the A of the Inr sequence

• Bind basal transcription factors

– We’ll revisit this when we discuss gene finding

Page 11: Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

Transcription Factor Binding Sites

• Regulatory transcription factors are sequence-specific DNA-binding proteins; sites are often found in or near gene promoter regions

• DNA sequence is called the response element

• What are the DNA sequences like?

Response elements

Page 12: Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

Some examples of patterns in protein sequences (motifs):

• Prediction of secondary and tertiary structure

– e.g. transcription factors• helix-turn-helix, b-zip, zinc-finger

Examples

• Presence of active sites of enzymes• Presence of cell localization signals

Page 13: Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

Exact vs Inexact (Approximate) Pattern Matching

• Exact Pattern Matching– Limited use in bioinformatics– Well-known algorithms (last week)– A common use of exact pattern matching is to

compare a sequence against a large number of possible known patterns such as in the identification of restriction sites

• Approximate– Most of the other examples of pattern matching in

bioinformatics

Page 14: Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

Other uses of exact pattern matching?

• Check PCR primers?

• Annotation? (text matching)

Page 15: Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

Why search for patterns?

• Pattern matching in sequences is also the basis of searching through a sequence database – Sequence alignment

Page 16: Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

Pairwise Sequence Alignment

• An alignment between 2 sequences is a pairwise match between sequences.

• Pairwise sequence comparison is the primary means of linking biological function to the genome and of propagating known information from one genome to another (Gibas & Jambeck)

.

Page 17: Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

Why are inexact pattern matches relevant in sequence alignments?

• Sequencing errors• Mutation

– 2 primary types• point mutations (affect a single nucleotide)• segmental mutations (affect a few to hundreds of

adjoining nucleotides)

– substitutions (transitions, transversions)– insertions, deletions

Page 18: Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

Mutations

• Point mutations usually occur from a nucleotide mismatch that becomes “fixed” during the process of replication– Escapes the DNA repair mechanism

• Significant when occur within a coding region and also cause a change in functionality– Non-synonymous mutation – Synonymous mutation: mutated sequence codes for same

amino acid as before mutation– Allowance for synonymous mutation due to wobble and

degeneracy of the codeCode Table

Page 19: Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

Evolutionary Considerations

• Through time mutations tend to be preserved if they are not deleterious

• Functionally important sequences tend to be conserved

• Non-functional or non-coding sequences diverge at a high rate

Page 20: Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

Evolutionary Considerations

• The tendency of functionally important sequences to remain relatively unchanged over time is the basis for sequence analysis– Allows us to draw evolutionary connections among

genes that are related in sequence