Introduction to
description
Transcript of Introduction to
![Page 1: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/1.jpg)
1
Introduction to
Bioinformatics
![Page 2: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/2.jpg)
2
Introduction to Bioinformatics.
LECTURE 2: GENE FINDING
* Chapter 2: All the sequence's men
![Page 3: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/3.jpg)
3
Introduction to BioinformaticsLECTURE 2: GENE FINDING
2.1 Human genome sweepstake
In 2003 Lee Rowen (Institute Systems Biology, Seattle) wins GeneSweep, the betting pool for the number of human genes
Her price: $1200 and a signed copy of Watson’s The Double Helix
Her guess: 25.947 genes
![Page 4: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/4.jpg)
4
Introduction to BioinformaticsLECTURE 2: GENE FINDING
2.1 Human genome sweepstake
Total bets in the GeneSweep:
2000: $1
2001: $5
2002: $20
2003: $1200
![Page 5: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/5.jpg)
5
Introduction to BioinformaticsLECTURE 2: GENE FINDING
2.1 Human genome sweepstake
The human genome counts 3.3 billion bp …
… but how to estimate
the number of human genes ?
![Page 6: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/6.jpg)
6
Introduction to BioinformaticsLECTURE 2: GENE FINDING
2.1 Human genome sweepstake
• 1990: estimate human genes ~300,000
• 1995: estimate human genes ~100,000
• 2000: estimate human genes ~30,000
• 2004: estimate human genes ~25,000
• 2008: estimate human genes ~22,000
• 2009: known human genes :18,308
![Page 7: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/7.jpg)
7
Introduction to BioinformaticsLECTURE 2: GENE FINDING
2.1 Human genome sweepstake
The pie shows that we’re now down to just 18,308 genes. That’s over 8,000 genes fewer than six years ago.
Many sequences that once looked like full-fledged genes, capable of generating a protein, now don’t make the grade. Some genes turned out to be pseudogenes – vestiges of genes that once worked but have been since wrecked by mutations. In other cases, DNA segments that appeared to be parts of separate genes have turned out to be part of the same gene.
![Page 8: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/8.jpg)
8
Introduction to BioinformaticsLECTURE 2: GENE FINDING
2.1 Human genome sweepstake
In this lecture we will try to estimate the
number of genes in a given DNA string
First however some biology …
![Page 9: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/9.jpg)
9
Introduction to BioinformaticsLECTURE 2: GENE FINDING
The human genome is stored on 23 chromosome pairs. 22 of these are autosomal chromosome pairs, while the remaining pair is sex-determining.
The haploid human genome occupies a total of just over 3 billion DNA base pairs. The Human Genome Project produced a reference sequence of the euchromatic human genome, which is used worldwide in biomedical sciences.
The haploid human genome contains an estimated 22,000 protein-coding genes, far fewer than had been expected before its sequencing. In fact, only about 1.5% of the genome codes for proteins, while the rest consists of RNA genes, regulatory sequences, introns and (controversially) "junk" DNA.
![Page 10: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/10.jpg)
10
Introduction to BioinformaticsLECTURE 2: GENE FINDING
2.2 Genes and Proteins
CENTRAL IDEA:
•Genes code for proteins
•There are fixed codes for START and STOP
•We can use those to look for DNA words:
[ START | n × <triplet> | STOP ]
•Such DNA words are s
![Page 11: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/11.jpg)
11
Introduction to BioinformaticsLECTURE 2: GENE FINDING
2.2 Genes and Proteins
DRAW-BACKS:
• Only candidate-genes are found
• Most of the DNA is non-coding “junk DNA” (???... )
• Where to start reading … and in what direction?
• Looooooooooooong computation times
![Page 12: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/12.jpg)
12
Introduction to BioinformaticsLECTURE 2: GENE FINDING
DNA
Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions specifying the biological development of all cellular forms of life (and most viruses).
DNA is a long polymer of nucleotides and encodes the sequence of the amino acid residues in proteins using the genetic code, a triplet code of nucleotides.
2.2 Genes and Proteins
![Page 13: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/13.jpg)
13
![Page 14: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/14.jpg)
14
DNA under electron microscope
![Page 15: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/15.jpg)
15
3D model of a section of the DNA molecule
![Page 16: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/16.jpg)
16
![Page 17: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/17.jpg)
17
![Page 18: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/18.jpg)
18
Genetic code The genetic code is a set of rules that maps DNA sequences to proteins in the living cell, and is employed in the process of protein synthesis.
Nearly all living things use the same genetic code, called the standard genetic code, although a few organisms use minor variations of the standard code.
Fundamental code in DNA: {x(i)|i=1..N,x(i) in {C,A,T,G}}
Human: N = 3.3 billion
![Page 19: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/19.jpg)
19
Genetic code
![Page 20: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/20.jpg)
20
Replication
of
DNA
![Page 21: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/21.jpg)
21
Genetic code: TRANSCRIPTION
DNA → RNA
Transcription is the process through which a DNA sequence is enzymatically copied by an RNA polymerase to produce a complementary RNA. Or, in other words, the transfer of genetic information from DNA into RNA. In the case of protein-encoding DNA, transcription is the beginning of the process that ultimately leads to the translation of the genetic code (via the mRNA intermediate) into a functional peptide or protein. Transcription has some proofreading mechanisms, but they are fewer and less effective than the controls for DNA; therefore, transcription has a lower copying fidelity than DNA replication.
Like DNA replication, transcription proceeds in the 5' → 3' direction (ie the old polymer is read in the 3' → 5' direction and the new, complementary fragments are generated in the 5' → 3' direction).
in RNA Thymine (T) → Uracil (U)
![Page 22: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/22.jpg)
22
Directionality: 5' to 3' directionDirectionality, in molecular biology, refers to the end-to-end chemical orientation of a single strand of nucleic acid. The chemical convention of naming carbon atoms in the nucleotide sugar-ring numerically gives rise to a 5' end and a 3' end (usually pronounced "five prime end" and "three prime end"). The relative positions of structures along a strand of nucleic acid, including genes, transcription factors, and polymerases are usually noted as being either upstream (towards the 5' end) or downstream (towards the 3' end).
The importance of having this naming convention lies in the fact that nucleic acids can only be synthesized in vivo in a 5' to 3' direction, as the polymerase used to assemble new strands must attach a new nucleotide to the 3' hydroxyl (-OH) group via a phosphodiester bond. By convention, single strands of DNA and RNA sequences are written in 5' to 3' direction.
![Page 23: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/23.jpg)
23
Genetic code: TRANSCRIPTION
DNA → RNA
![Page 24: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/24.jpg)
24
Genetic code: TRANSLATION
RNA → protein
![Page 25: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/25.jpg)
25
Genetic code: exons/introns
![Page 26: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/26.jpg)
26
Genetic code: TRANSLATIONDNA-triplet → RNA-triplet = codon → amino acid RNA codon tableThere are 20 standard amino acids used in proteins, here are some of the RNA-codons that code for each amino acid.
Ala A GCU, GCC, GCA, GCGLeu L UUA, UUG, CUU, CUC, CUA, CUGArg R CGU, CGC, CGA, CGG, AGA, AGGLys K AAA, AAGAsn N AAU, AACMet M AUGAsp D GAU, GACPhe F UUU, UUCCys C UGU, UGCPro P CCU, CCC, CCA, CCG...Start AUG, GUGStop UAG, UGA, UAA
![Page 27: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/27.jpg)
27
Protein Structure: primary structure
![Page 28: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/28.jpg)
28
Protein Structure:
secondary Structure
a: Alpha-helix,
b: Beta-sheet
![Page 29: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/29.jpg)
29
Protein Structure: super-secondary Structure
![Page 30: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/30.jpg)
30
Protein Structure = protein function:
![Page 31: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/31.jpg)
31
Introduction to BioinformaticsLECTURE 2: GENE FINDING
Standard Genetic Code
note:RNA ‘U’ ~ DNA ‘A’
![Page 32: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/32.jpg)
32
Introduction to BioinformaticsLECTURE 2: GENE FINDING
intron - exon
![Page 33: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/33.jpg)
33
Introduction to BioinformaticsLECTURE 2: GENE FINDING
2.3 Gene annotation: gene finding
• Statistical analysis (eg GC-content) can identify different regions on a DNA strand
• ab initio methods (=statistical analysis)
• Markov sequence model
![Page 34: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/34.jpg)
34
Change points in Labda-phage
![Page 35: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/35.jpg)
35
Introduction to BioinformaticsLECTURE 2: Section 2.3 Gene annotation: gene finding
• Ab initio methods suffice for finding genes on Prokaryotic DNA
• For more complex Eukaryotic DNA we need sequence alignment methods and Markov sequence models.
![Page 36: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/36.jpg)
36
Introduction to BioinformaticsLECTURE 2: Section 2.3 Gene annotation: gene finding
READING FRAMES
The DNA is translated per codon = nucleotide-triplet.
The sequence: …ACGTACGTACGTACGTACGT…Can thus be read as:
…-ACG-TAC-GTA-CGT-ACG-TAC-GT…
or: …A-CGT-ACG-TAC-GTA-CGT-ACG-T…
or: …AC-GTA-CGT-ACG-TAC-GTA-CGT-…
![Page 37: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/37.jpg)
37
Introduction to BioinformaticsLECTURE 2: Section 2.3 Gene annotation: gene finding
OPEN READING FRAMES: ORF
An open reading frame or ORF is a portion of an organism's genome which contains a sequence of bases that could potentially encode a protein
In a gene, ORFs are located between the start-code sequence (initiation codon) and the stop-code sequence (termination codon).
![Page 38: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/38.jpg)
38
Introduction to BioinformaticsLECTURE 2: Section 2.3 Gene annotation: gene finding
OPEN READING FRAMES: ORF
As we saw, we can distinguish 3 possible ORFs on one strand (5’ to 3’).
On the complementary strand (5’ to 3’) we can also look for 3 possibiloties – but these can be reconstructed from the first strand.
So, we can distinguish 6 possible ORFs
![Page 39: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/39.jpg)
39
Introduction to BioinformaticsLECTURE 2: Section 2.3 Gene annotation: gene finding
OPEN READING FRAMES: ORF
![Page 40: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/40.jpg)
40
Introduction to BioinformaticsLECTURE 2: Section 2.3 Gene annotation: gene finding
Introns and Exons
![Page 41: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/41.jpg)
41
Introduction to BioinformaticsLECTURE 2: GENE FINDINGAlgorithm 2.1: ORF-finder
Given a DNA sequence s and a positive integer k, for each possible
reading frame decompose the sequence into triplets, and find all
stretches of triplets starting with a START-codon and ending with a
STOP-codon.
Repeat also for the reverse compliment of s.
The Output consists of all ORFS longer than or equal to the prefixed
threshold k.
![Page 42: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/42.jpg)
42
Introduction to BioinformaticsLECTURE 2: GENE FINDING
2.4 Detecting spurious signals
• a pattern in DNA can arise from pure chance
• hypothesis testing with null-hypothesis H0
• test statistics
• p-value = probability-value
• significance level
• type I-error (FP = False Positive) of H0
• type II-error (FN = False Negative) of H0
![Page 43: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/43.jpg)
43
Hypothesis testing with H0
![Page 44: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/44.jpg)
44
Introduction to BioinformaticsLECTURE 2: Section 2.4 Spurious signals
Computing a p-value for ORFs
• Translation table : triplet → aminoacid (AA)
• 64 possible triplets (for 20 AAs)
• 1 start-codon ATG = M = Met = Methionine
• 3 stop-codons TAA, TAG, and TGA
• a priori probability non stop-codon = 61/64
![Page 45: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/45.jpg)
45
Introduction to BioinformaticsLECTURE 2: Section 2.4 Spurious signals
Computing a p-value for ORFs
•a priori probability non stop-codon = 61/64
• P(k non-stopcodons) = (61/64)k
• 95%-significance : p = 0.05
• (61/64)k ≈ 0.05 k ≈ 62 +/- 64 codons
• 99%-significance : p = 0.01
• (61/64)k ≈ 0.01 k ≈ 100 +/- 102 codons
![Page 46: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/46.jpg)
46
Introduction to BioinformaticsLECTURE 2: Section 2.4 Spurious signals
Non-uniform codon distribution
• Pstop = P(TAA) + P(TAG) + P(TGA)
• P(k non-stop codons) = (1 - Pstop)k
• For a significance-level α we need k* codons
with: (1 - Pstop)k* = α
![Page 47: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/47.jpg)
47
Introduction to BioinformaticsLECTURE 2: Section 2.4 Spurious signals
Randomization tests
• Generate a string with the same statistical properties of the original data
• Per nucleotide? per triplet? per … ?
• p-value: find the rank of observed test statistic in null distribution: if its percentile is less than α then it is significant
![Page 48: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/48.jpg)
48
Introduction to BioinformaticsLECTURE 2: Section 2.4 Spurious signals
Randomization tests
• Another method is bootstrapping
• No permutation but sampling with replacement
• Again: per nucleotide? per triplet? per … ?
• p-value: find the rank of observed test statistic in null distribution: if its percentile is less than α then it is significant
![Page 49: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/49.jpg)
49
Introduction to BioinformaticsLECTURE 2: Section 2.4 Spurious signals
Example: ORF length in Mycoplasma genitalium
• original DNA sequence: 11,922 ORFs
• single-nucleotide permutation test = multinomial distribution
• permute, search ORFs, record their length
• randomized DNA sequence: 17,367 ORFs
• H0 = randomized DNA sequence
![Page 50: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/50.jpg)
50
Introduction to BioinformaticsLECTURE 2: Section 2.4 Spurious signals
ORF length in Mycoplasma genitalium
• This approach does not identify short genes
• Smaller threshold for ORF-length: the upper 5% of randomized DNA
• In original DNA 1520 ORFs in this upper 5%
• Many FALSE POSITIVES but still better than the original 11,922 in the DNA of M.gen.
![Page 51: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/51.jpg)
51
Introduction to BioinformaticsLECTURE 2: Section 2.4 Spurious signals
ORF length in Mycoplasma genitalium
• H0 = randomized DNA sequence
• Keep ORFs in original DNA that are longer than (most) ORFs in the randomized DNA
• max(ORF-length) in random seq. = 402 bp
• in original DNA 326 ORFs longer than 402 bp
• Good estimate: M. genitalium has 470 genes
![Page 52: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/52.jpg)
52
Introduction to BioinformaticsLECTURE 2: Section 2.4 Spurious signals
Here follows the ORF-length distribution in the original and the randomized DNA
* Note the long tail of the real DNA! *
![Page 53: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/53.jpg)
53
Introduction to BioinformaticsLECTURE 2: Section 2.4 Spurious signals
![Page 54: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/54.jpg)
54
Introduction to BioinformaticsLECTURE 2: Section 2.4 Spurious signals
Example 2: ORF length in Haemophilus influenzae
• threshold = max(ORF-length) in random seq. = 537 bp
• in original DNA 1182 ORFs longer than 537 bp
• this is about the real number of genes: 1428
![Page 55: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/55.jpg)
55
Introduction to BioinformaticsLECTURE 2: Section 2.4 Spurious signals
Problems with multiple testing
• the α-significance (e.g. 5%) represents the false positive rate of one single test
• If we conduct – say – 100 tests this means that 5 false positives are expected
• therefore, if 5 significant genes were found out of 100 tests with α = 0.05 this does not mean anything biologically!
![Page 56: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/56.jpg)
56
Introduction to BioinformaticsLECTURE 2: Section 2.4 Spurious signals
Problems with multiple testing
• The ORFs found in this way are only candidate genes!
• it is not clear at this point whether these ORFS are actually translated to proteins!
• Using sequence alignment (chapter 3) the case for a candidate gene can be tightened
![Page 57: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/57.jpg)
57
Introduction to BioinformaticsLECTURE 2: Section 2.4 Spurious signals
![Page 58: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/58.jpg)
58
END of LECTURE 2
![Page 59: Introduction to](https://reader035.fdocuments.in/reader035/viewer/2022062309/56815931550346895dc667cd/html5/thumbnails/59.jpg)
59