Post on 05-Jan-2016
description
Bio-Medical Informatics
Instructor : Hanif YaghoobiWebsite: site444703.44.webydo.com
E-mail : Hyiautcourse@gmail.comMy personal Mail: hanifeyaghoobi@gmail.com
About this Course
• Activities during the semester 5 score:1)Home Works2) MATLAB exercises• Your Final Projects 3 score• Final Exam 12 score
Shortliffe
“ Medical informatics is the rapidly developing scientific field that deals with resources, devices and formalized methods for optimizing the storage, retrieval and management of biomedical information for problem solving and decision making”
Edward Shortliffe, MD, PhD
1995
Organisms
• Classified into two types:
• Eukaryotes: contain a membrane-bound nucleus and organelles (plants, animals, fungi,…)
• Prokaryotes: lack a true membrane-bound nucleus and organelles (single-celled, includes bacteria)
• Not all single celled organisms are prokaryotes!
15
Cells
• Complex system enclosed in a membrane
• Organisms are unicellular (bacteria, baker’s yeast) or multicellular
• Humans:– 60 trillion cells – 320 cell types
16
Example Animal Cellwww.ebi.ac.uk/microarray/ biology_intro.htm
DNA Basics – cont.
• DNA in Eukaryotes is organized in chromosomes.
17
Chromosomes
• In eukaryotes, nucleus contains one or several double stranded DNA molecules orgainized as chromosomes
• Humans: – 22 Pairs of autosomes– 1 pair sex chromosomes
18
Human Karyotype http://avery.rutgers.edu/WSSP/StudentScholars/
Session8/Session8.html
What is DNA?
• DNA: Deoxyribonucleic Acid
• Single stranded molecule (oligomer, polynucleotide) chain of nucleotides
• 4 different nucleotides:– Adenosine (A)– Cytosine (C)– Guanine (G)– Thymine (T)
20
Nucleotide Bases
• Purines (A and G)• Pyrimidines (C and T)• Difference is in base structure
21
Image Source: www.ebi.ac.uk/microarray/ biology_intro.htm
DNA
22
23
The Central DogmaProtein Synthesis
Cell Function
Genome Transcriptome Proteome
Transcription Translation
Gene Expression
Level
Genome
• chromosomal DNA of an organism
• number of chromosomes and genome size varies quite significantly from one organism to another
• Genome size and number of genes does not necessarily determine organism complexity
28
Genome Comparison
29
ORGANISM CHROMOSOMES GENOME SIZE GENES
Homo sapiens (Humans)
23 3,200,000,000 ~ 30,000
Mus musculus(Mouse)
20 , 2600,000,000 ~30,000
Drosophila melanogaster
(Fruit Fly)
4 180,000,000 ~18,000
Saccharomyces cerevisiae (Yeast)
16 14,000,000 ~6,000
Zea mays (Corn) 10 2,400,000,000 ???
30
DNA Basics – cont.
• The DNA in each chromosome can be read as a discrete signal to {a,t,c,g}. (For example: atgatcccaaatggaca…)
31
DNA Basics – cont.
• In genes (protein-coding region), during the construction of proteins by amino acids, these nucleotides (letters) are read as triplets (codons). Every codon signals one amino acid for the protein synthesis (there are 20 aa).
32
DNA Basics – cont.
• There are 6 ways of translating DNA signal to codons signal, called the reading frames (3 * 2 directions).
33
…CATTGCCAGT…
DNA Basics – Cont.
34
…CATTGCCAGT…
Start: ATG
Stop: TAA, TGA, TAG
gene
Exon ExonExon IntronIntron Exon
Understanding Genome Sequences~3,289,000,000 characters:
aattgtgctctgcaaattatgatagtgatctgtatttactacgtgcatat attttgggccagtgaatttttttctaagctaatatagttatttggacttt tgacatgactttgtgtttaattaaaacaaaaaaagaaattgcagaagtgt tgtaagcttgtaaaaaaattcaaacaatgcagacaaatgtgtctcgcagt cttccactcagtatcatttttgtttgtaccttatcagaaatgtttctatg tacaagtctttaaaatcatttcgaacttgctttgtccactgagtatatta tggacatcttttcatggcaggacatatagatgtgttaatggcattaaaaa taaaacaaaaaactgattcggccgggtacggtggctcacgcctgtaatcc cagcactttgggagatcgaggagggaggatcacctgaggtcaggagttac agacatggagaaaccccgtctctactaaaaatacaaaattagcctggcgt ggtggcgcatgcctgtaatcccagctactcgggaggctgaggcaggagaa tcgcttgaacccgggagcggaggttgcggtgagccgagatcgcaccgttg cactccagcctgggcgacagagcgaaactgtctcaaacaaacaaacaaaa aaacctgatacatggtatgggaagtacattgtttaaacaatgcatggaga tttaggttgtttccagtttttactggcacagatacggcaatgaatataat tttatgtatacattcatacaaatatatcggtggaaaattcctagaagtgg aatggctgggtcagtgggcattcatattgagaaattggaaggatgttgtc aaactctgcaaatcagagtattttagtcttaacctctcttcttcacaccc ttttccttggaagaaagctaaatttagacttttaaacacaaaactccatt ttgagacccctgaaaatctgggttcaaagtgtttgaaaattaaagcagag gctttaatttgtacttatttaggtataatttgtactttaaagttgttcca
. . . 35
Goal: Identify components encoded in the DNA sequence
Open Reading Frame
• Protein-encoding DNA sequence consists of a sequence of 3 letter codons
• Starts with the START codon (ATG)• Ends with a STOP codon (TAA, TAG, or TGA)
36
ATGCTCAGCGTGACCTCA . . . CAGCGTTAA
M L S V T S . . . Q R STP
Finding Open Reading Frames
Try all possible starting points• 3 possible offsets• 2 possible strands
Simple algorithm finds all ORFs in a genome• Many of these are spurious (are not real genes)• How do we focus on the real ones?
37
ATGCTCAGCGTGACCTCA . . . CAGCGTTAA
M L S V T S . . . Q R STP
Using Additional Genomes
Basic premise“What is important is conserved”
Evolution = Variation + Selection– Variation is random– Selection reflects function
Idea: • Instead of studying a single genome, compare related
genomes• A real open reading frame will be conserved
38
Phylogentic Tree of Yeasts
39Kellis et al, Nature 2003
S. cerevisiae
S. paradoxus
S. mikataeS. bayanus
C. glabrata
S. castellii
K. lactis
A. gossypii
K. waltii
D. hansenii
C. albicans
Y. lipolytica
N. crassa
M. graminearum
M. grisea
A. nidulans
S. pombe
~10M years
Evolution of Open Reading Frame
40
ATGCTCAGCGTGACCTCA . . . ATGCTCAGCGTGACATCA . . . ATGCTCAGGGTGACA--A . . . ATGCTCAGG---ACA--A . . .
S. cerevisiaeS. paradoxusS. mikataeS. bayanus
Conservedpositions
Variablepositions
A deletion
Frame shiftchanges interpretationof downstream seq
ExamplesSpurious ORF
41
Frame shift
[Kellis et al, Nature 2003]
Sequencingerror
Confirmed ORF
ConservedVariable
ATG notconserved
Greedy algorithm to find conserved ORFs surprisingly effective (> 99% accuracy) on verified yeast data
Defining ConservationNaïve approach• Consensus between all
speciesProblem: • Rough grained• Ignores distances between species• Ignores the tree topology
Goal:• More sensitive and
robust methods42
AAAA
AA
AA
A
AAAA
CC
CC
C
ACAG
TC
GG
T
CCCA
CA
AA
C
Conserved
Variable
100% conserv 33 5555
Bioinformatics – an area of emerging knowledge
• Each cell of the body contains the whole DNA of the individual (about 40,000 genes in the human genome, each of them comprising from 50 to a mln base pairs – A,T,C or G)
• The Main Dogma in Genetics: DNA->RNA->proteins
• Transcription: DNA (about 5%) -> mRNA – DNA -> pre-RNA -> splicing -> mRNA (only the exons)
• Translation: mRNA -> proteins– Proteins make cells alive and specialised (e.g. blue eyes)– Genome -> proteome N.Kasabov, 2003
Bioinformatics
• The area of Science that is concerned with the development and applications of methods, tools and systems for storing and processing of biological information to facilitate knowledge discovery.
• Interdisciplinary: Information and computer science, Molecular Biology, Biochemistry, Genetics, Physics, Chemistry, Health and Medicine, Mathematics and Statistics, Engineering, Social Sciences.
• Biology, Medicine -- Information Science --> IT, Clinics, Pharmacy, I____________________I • Links to Health informatics, Clinical DSS, Pharmaceutical Industry
N.Kasabov, 2003
N.Kasabov, 2003
Bioinformatics: challenging problems for computer and information sciences
• Discovering patterns (features) from DNA and RNA sequences (e.g. genes, promoters, RBS binding sites, splice junctions)
• Analysis of gene expression data and predicting protein abundance
• Discovering of gene networks – genes that are co-regulated over time
• Protein discovery and protein function analysis
• Predicting the development of an organism from its DNA code (?)
• Modeling the full development (metabolic processes) of a cell (?)
• Implications: health; social,…
N.Kasabov, 2003
Problems in Computational Modeling for Bioinformatics
• Abundance of genome data, RNA data, protein data and metabolic pathway data is now available (see http://www.ncbi.nlm.nih.gov) and this is just the beginning of computational modeling in Bioinformatics
• Complex interactions:– between proteins, genes, DNA code, – between the genome and the environment – much yet to to be discovered
• Stability and repetitiveness: Genes are relatively stable carriers of information.
• Many sources of uncertainty:– Alternative splicing– Mutation in genes caused by: ionising radiation (e.g. X-rays); chemical contamination, replication
errors, viruses that insert genes into host cells, aging processes, etc.– Mutated genes express differently and cause the production of different proteins
• It is extremely difficult to model dynamic, evolving processes
Bioinformatics Important Challenges
Transcription Translation
Gene Predication
Gene FunctionProtein FunctionProtein 3D Structure
Public Data Base
Transcription Translation
DNA sequence {A,T,C,G}
Microarray Protein sequenceKMLSLLMARTYW
Gene Expression
Level
Gene Expression
49
Microarray • What can it be used for? • How does it work?• What are the Advantages?
An Example Application
Microarrays can be used for:Comparison of transcription levels between two cells
Examples:Comparison between:Cells from a young mouse vs cell from an old mouse
Drug efficacy:Treated cells vs untreated cells
How it works:Based on hybridization
A =C ≡T =T =G ≡A =C ≡C ≡ ▀
UGAACUGG
A C T T GA C C ▀
TGAACTGG
UGAACUGG
A =C ≡T =T =A ≡A =C ≡C ≡ ▀
UGAAUUGG
UGAAUUGG
mRNA
A =C ≡T =T =A ≡A =C ≡C ≡ ▀
MicrotiterPlates
Print Head
slides (100)
Probes and the printing process
Print HeadPins
Print Head with Pins
23/2/2008 60
Microarray Technology
probe(on chip)
sample(labelled)
pseudo-colourimage
[image from Jeremy Buhler]
Experimental design Track what’s on the chip
which spot corresponds to which gene
Duplicate experimental spots reproducibility
Controls DNAs spotted on glass
positive probe (induced or repressed)negative probe (bacterial genes on human chip)
oligos on glass or synthesised on chip (Affymetrix)point mutants (hybridisation plus/minus)
Images from scanner Resolution
standard 10m [currently, max 5m] 100m spot on chip = 10 pixels in diameter
Image format TIFF (tagged image file format) 16 bit (65’536 levels of grey) 1cm x 1cm image at 16 bit = 2Mb (uncompressed) other formats exist e.g.. SCN (used at Stanford University)
Separate image for each fluorescent sample channel 1, channel 2, etc.
Images in analysis software The two 16-bit images (cy3, cy5) are compressed into 8-bit images Goal : display fluorescence intensities for both wavelengths using a
24-bit RGB overlay image RGB image :
Blue values (B) are set to 0 Red values (R) are used for cy5 intensities Green values (G) are used for cy3 intensities
Qualitative representation of results
Images : examples
cy3
cy5Spot color Signal strength Gene
expression
yellow Control = perturbed unchanged
red Control < perturbed induced
green Control > perturbed repressed
Pseudo-color overlay
Data : DNA Microarray
23/2/2008 66
0 10 20 30 40 50 60time (min)
gene 1
gene 2
gene 3
assay
Data Required: Gene Expression Matrix
t1 t2 t3 t4
g1 0 1 2 1
g2 1 2 1 0
g3 0 1 1 1.
g4 1 2 1 0
23/2/2008 67
Data Required: Gene Expression Matrix
a1 a2 a3 a4
g1 0 3 1 1
g2 1 2 1 0
g3 0 1 1 1.
g4 1 2 1 0
23/2/2008 68
Snap Shot
t1 t2 t3 t4
g1 0 1 2 1
g2 1 2 1 0
g3 0 1 1 1.
g4 1 2 1 0
Time serious
• World Health Organization