Introduktion til Bioinformatik Hold 01 Oktober 2010.

46
Introduktion til Bioinformatik Hold 01 Oktober 2010

Transcript of Introduktion til Bioinformatik Hold 01 Oktober 2010.

Page 1: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Introduktion til Bioinformatik

Hold 01

Oktober 2010

Page 2: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Introduktion

Rasmus Wernersson, Lektor

Anders Gorm Pedersen, Docent

Center for Biologisk Sekvensanalyse, DTU

Page 3: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Oversigt

Data & Databaser

Metoder

•Taksonomi

•DNA

•Protein

•Protein struktur

•Alignment

•Pairwise + Multiple

•BLAST (søgning)

•Fylogenetiske træer

•PyMOL (3D visualisering)

Opsamlende øvelseMalaria vaccine

Page 4: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Øvelserne er det primære

Page 5: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Kursusplan på vores wiki

Page 6: Introduktion til Bioinformatik Hold 01 Oktober 2010.

On evolution and sequences

Background information

Page 7: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Classification: Linnaeus

Carl LinnaeusCarl Linnaeus1707-17781707-1778

Page 8: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Classification: Linnaeus

• Hierarchical system

– Kingdom– Phylum– Class– Order– Family– Genus– Species

Page 9: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Classification depicted as a tree

Page 10: Introduktion til Bioinformatik Hold 01 Oktober 2010.

No “mixed” animals

Source: www.dr.dk/oline

Page 11: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Classification depicted as a tree

SpeciesSpecies GenusGenus FamilyFamily OrderOrder ClassClass

Page 12: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Comparison of limbs

Image source: http://evolution.berkeley.edu

Page 13: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Theory of evolution

Charles DarwinCharles Darwin1809-18821809-1882

Page 14: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Phylogenetic basis of systematics

• Linnaeus: Ordering principle is God.

• Darwin: Ordering principle is shared descent from common ancestors.

• Today, systematics is explicitly based on phylogeny.

Page 15: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Natural Selection: Darwin’s four postulates

• More young are produced each generation than can survive to reproduce.

• Individuals in a population vary in their characteristics.

• Some differences among individuals are based on genetic differences.

• Individuals with favorable characteristics have higher rates of survival and reproduction.

• Evolution by means of natural selection• Presence of ”design-like” features in organisms:• Quite often features are there “for a reason”

Page 16: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Evolution at the sequence level

Page 17: Introduktion til Bioinformatik Hold 01 Oktober 2010.

About DNA

• DNA contains the recipes of how to make protein / enzymes.

• Every time a cells divides it’s DNA is duplicated, and each daughter cell gets a copy.

Page 18: Introduktion til Bioinformatik Hold 01 Oktober 2010.

The DNA alphabet

• The information in the DNA is written in a four letter code: A, T, G, C.

• The DNA can be “sequenced” and the result stored in a computer file.

• ATGGCCCTGTGGAT

Page 19: Introduktion til Bioinformatik Hold 01 Oktober 2010.

DNA is always written 5’ 3’

5’ AGCC 3’

3’ TCGG 5’

5’ ATGGCCAGGTAA 3’DNA backbone: http://en.wikipedia.org/wiki/DNA(Deoxy)ribose: http://en.wikipedia.org/

Ribose

1

23

4

5

Deoxyribose

1

23

4

5

5’

3’

5’

3’

Page 20: Introduktion til Bioinformatik Hold 01 Oktober 2010.

• ATGGCCCTGTGGATGCG

Can DNA be changed?

Page 21: Introduktion til Bioinformatik Hold 01 Oktober 2010.

• ATGGCCCTGTGGATGCG

• ATGGCCCTATGGATGCG

Can DNA be changed?

Page 22: Introduktion til Bioinformatik Hold 01 Oktober 2010.

A history of mutations

ATGGCCCTGTGTATGCG

ATGGCAATGTGGATGCA

ATGGCCCTGTGGATGCG

ATGGCCCCGTGGATGCG

ATGTCCCCGTGGATGCGATGGCCCCGTGGAACCG

Time

Page 23: Introduktion til Bioinformatik Hold 01 Oktober 2010.

• Species1: ATGGCAATGTGGATGCA• Species2: ATGGCCCCGTGGAACCG• Species3: ATGTCCCCGTGGATGCG

“DNA alignment”

3

65

Page 24: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Real life example: Alignment

• Insulin from 7 different species• Homo: ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAA• Pan: ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGTGCTGCTGGCCCTCTGGGGACCTGACCCAGCCTCGGCCTTTGTGAA• Sus: ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCCCTCTGGGCGCCCGCCCCGGCCCAGGCCTTCGTGAA• Ovis: ATGGCCCTGTGGACACGCCTGGTGCCCCTGCTGGCCCTGCTGGCACTCTGGGCCCCCGCCCCGGCCCACGCCTTCGTCAA• Canis: ATGGCCCTCTGGATGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCCCTCTGGGCGCCCGCGCCCACCCGAGCCTTCGTTAA• Mus: ATGGCCCTGTTGGTGCACTTCCTACCCCTGCTGGCCCTGCTTGCCCTCTGGGAGCCCAAACCCACCCAGGCTTTTGTCAA• Gallus: ATGGCTCTCTGGATCCGATCACTGCCTCTTCTGGCTCTCCTTGTCTTTTCTGGCCCTGGAACCAGCTATGCAGCTGCCAA

Page 25: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Real life example: Tree

Page 26: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Interpretation of Multiple Alignments

Conserved features assumed to be important for functionality

For instance: conserved pairs of cysteines indicate possible disulphide bridge

Page 27: Introduktion til Bioinformatik Hold 01 Oktober 2010.

• Darwin: all organisms are related through descent with modification

• Prediction: similar molecules have similar functions in different organisms

Protein synthesis carried out by very similar RNA-containing molecular complexes (ribosomes) that are present in all known organisms

Sequences are related

Page 28: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Sequences are related, II

Related oxygen-binding proteins in humans

Page 29: Introduktion til Bioinformatik Hold 01 Oktober 2010.

DNA as Biological Information

Rasmus Wenersson

Page 30: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Overview

• Learning objectives– About Biological Information– A note about DNA sequencing techniques

and DNA data– File formats used for biological data– Introduction to the GenBank database

Page 31: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Information flow in biological systems

Page 32: Introduktion til Bioinformatik Hold 01 Oktober 2010.

DNA sequences = summary of information

5’ AGCC 3’

3’ TCGG 5’

5’ ATGGCCAGGTAA 3’DNA backbone: http://en.wikipedia.org/wiki/DNA(Deoxy)ribose: http://en.wikipedia.org/

Ribose

1

23

4

5

Deoxyribose

1

23

4

5

5’

3’

5’

3’

Page 33: Introduktion til Bioinformatik Hold 01 Oktober 2010.

PCR

Melting96º , 30 sec

Annealing~55º, 30 sec

Extension72º , 30 sec

35 cycles

Animation: http://depts.washington.edu/~genetics/courses/genet371b-aut99/PCR_contents.html

Page 34: Introduktion til Bioinformatik Hold 01 Oktober 2010.

PCR

Der kræves QuickTime™ og et -komprimeringsværktøj,

for at man kan se dette billede.

Animation: http://www.people.virginia.edu/~rjh9u/pcranim.htmlPCR graph: http://pathmicro.med.sc.edu/pcr/realtime-home.htm

Page 35: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Gel electrophoresis

• DNA fragments are seperated using gel electrophoresis– Typically 1% argarose– Colored with EtBr or ZybrGreen

(glows in UV light).– A DNA ”ladder” is used for

identification of known DNA lengths.

Gel picture: http://www.pharmaceutical-technology.com/projects/roche/images/roche3.jpg

PCR setup: http://arbl.cvmbs.colostate.edu/hbooks/genetics/biotech/gels/agardna.html

+

-

Page 36: Introduktion til Bioinformatik Hold 01 Oktober 2010.

The Sanger method of DNA sequencing

Images: http://www.idtdna.com/support/technical/TechnicalBulletinPDF/DNA_Sequencing.pdf

}

Terminator

X-ray sequenceing gel

OH

Page 37: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Automated sequencing

• The major break-through

of sequencing has

happended through

automation.

• Fluorescent dyes.

• Laser based scanning.

• Capillary electrophoresis

• Computer based base-

calling and assembly.

Images: http://www.idtdna.com/support/technical/TechnicalBulletinPDF/DNA_Sequencing.pdf

Page 38: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Handout exercise: ”base-calling”

• Handout: Chromotogram

• Groups of 2-3.

• Tasks:– Identify “difficult” regions– Identify “difficult”

sequence stretches. – Try to estimate the best

interval to use.

Page 39: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Biological data on computers

• The GenBank database

• File formats– FASTA– GenBank

Page 40: Introduktion til Bioinformatik Hold 01 Oktober 2010.

NCBI GenBank

• GenBank is one of the main internaltional DNA databases.

• GenBank is hosted by NCBI: National Center for Biotechnology Information.

• GenBank has exists since 1982.

• The database is public - no restrictions on the use of the data within.

Page 41: Introduktion til Bioinformatik Hold 01 Oktober 2010.

FASTA format

>alpha-DATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTGGAGCCGAGGCCCTGGAGAGGTGCGGGCTGAGCTTGGGGAAACCATGGGCAAGGGGGGCGACTGGGTGGGAGCCCTACAGGGCTGCTGGGGGTTGTTCGGCTGGGGGTCAGCACTGACCATCCCGCTCCCGCAGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTTGCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAGAGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACCCTGTCAACTTCAAGGCAGGCGGGGGACGGGGGTCAGGGGCCGGGGAGTTGGGGGCCAGGGACCTGGTTGGGGATCCGGGGCCATGCCGGCGGTACTGAGCCCTGTTTTGCCTTGCAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACACCCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGATAA>alpha-AATGGTGCTGTCTGCCAACGACAAGAGCAACGTGAAGGCCGTCTTCGGCAAAATCGGCGGCCAGGCCGGTGACTTGGGTGGTGAAGCCCTGGAGAGGTATGTGGTCATCCGTCATTACCCCATCTCTTGTCTGTCTGTGACTCCATCCCATCTGCCCCCATACTCTCCCCATCCATAACTGTCCCTGTTCTATGTGGCCCTGGCTCTGTCTCATCTGTCCCCAACTGTCCCTGATTGCCTCTGTCCCCCAGGTTGTTCATCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACCTGTCACATGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTGGCGGAGGCACTGGTTGAGGCTGCCAACCACATCGATGACATCGCTGGTGCCCTCTCCAAGCTGAGCGACCTCCACGCCCAAAAGCTCCGTGTGGACCCCGTCAACTTCAAAGTGAGCATCTGGGAAGGGGTGACCAGTCTGGCTCCCCTCCTGCACACACCTCTGGCTACCCCCTCACCTCACCCCCTTGCTCACCATCTCCTTTTGCCTTTCAGCTGCTGGGTCACTGCTTCCTGGTGGTCGTGGCCGTCCACTTCCCCTCTCTCCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCGTGTGTGCCGTGGGCACCGTCCTTACTGCCAAGTACCGTTAA

(Handout)

Page 42: Introduktion til Bioinformatik Hold 01 Oktober 2010.

GenBank format

• Originates from the GenBank database.

• Contains both a DNA sequence and annotation of feature (e.g. Location of genes).

(handout)

Page 43: Introduktion til Bioinformatik Hold 01 Oktober 2010.

GenBank format - HEADER

LOCUS CMGLOAD 1185 bp DNA linear VRT 18-APR-2005DEFINITION Cairina moschata (duck) gene for alpha-D globin.ACCESSION X01831VERSION X01831.1 GI:62724KEYWORDS alpha-globin; globin.SOURCE Cairina moschata (Muscovy duck) ORGANISM Cairina moschata Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Archosauria; Aves; Neognathae; Anseriformes; Anatidae; Cairina.REFERENCE 1 (bases 1 to 1185) AUTHORS Erbil,C. and Niessing,J. TITLE The primary structure of the duck alpha D-globin gene: an unusual 5' splice junction sequence JOURNAL EMBO J. 2 (8), 1339-1343 (1983) PUBMED 10872328COMMENT Data kindly reviewed (13-NOV-1985) by J. Niessing.

Page 44: Introduktion til Bioinformatik Hold 01 Oktober 2010.

GenBank format - ORIGIN section

ORIGIN 1 ctgcgtggcc tcagcccctc cacccctcca cgctgataag ataaggccag ggcgggagcg 61 cagggtgcta taagagctcg gccccgcggg tgtctccacc acagaaaccc gtcagttgcc 121 agcctgccac gccgctgccg ccatgctgac cgccgaggac aagaagctca tcgtgcaggt 181 gtgggagaag gtggctggcc accaggagga attcggaagt gaagctctgc agaggtgtgg 241 gctgggccca gggggcactc acagggtggg cagcagggag caggagccct gcagcgggtg 301 tgggctggga cccagagcgc cacggggtgc gggctgagat gggcaaagca gcagggcacc 361 aaaactgact ggcctcgctc cggcaggatg ttcctcgcct acccccagac caagacctac 421 ttcccccact tcgacctgca tcccggctct gaacaggtcc gtggccatgg caagaaagtg 481 gcggctgccc tgggcaatgc cgtgaagagc ctggacaacc tcagccaggc cctgtctgag 541 ctcagcaacc tgcatgccta caacctgcgt gttgaccctg tcaacttcaa ggcaagcggg 601 gactagggtc cttgggtctg ggggtctgag ggtgtggggt gcagggtctg ggggtccagg 661 ggtctgagtt tcctggggtc tggcagtcct gggggctgag ggccagggtc ctgtggtctt 721 gggtaccagg gtcctggggg ccagcagcca gacagcaggg gctgggattg catctgggat 781 gtgggccaga ggctgggatt gtgtttggaa tgggagctgg gcaggggcta gggccagggt 841 gggggactca gggcctcagg gggactcggg gggggactga gggagactca gggccatctg 901 tccggagcag gggtactaag ccctggtttg ccttgcagct gctggcacag tgcttccagg 961 tggtgctggc cgcacacctg ggcaaagact acagccccga gatgcatgct gcctttgaca 1021 agttcttgtc cgccgtggct gccgtgctgg ctgaaaagta cagatgagcc actgcctgca 1081 cccttgcacc ttcaataaag acaccattac cacagctctg tgtctgtgtg tgctgggact 1141 gggcatcggg ggtcccaggg agggctgggt tgcttccaca catcc//

Page 45: Introduktion til Bioinformatik Hold 01 Oktober 2010.

FEATURES Location/Qualifiers source 1..1185 /organism="Cairina moschata" /mol_type="genomic DNA" /db_xref="taxon:8855" CAAT_signal 20..24 TATA_signal 69..73 precursor_RNA 101..1114 /note="primary transcript" exon 101..234 /number=1 CDS join(143..234,387..591,939..1067) /codon_start=1 /product="alpha D-globin" /protein_id="CAA25966.2" /db_xref="GI:4455876" /db_xref="GOA:P02003" /db_xref="InterPro:IPR000971" /db_xref="InterPro:IPR002338" /db_xref="InterPro:IPR002340" /db_xref="InterPro:IPR009050" /db_xref="UniProt/Swiss-Prot:P02003" /translation="MLTAEDKKLIVQVWEKVAGHQEEFGSEALQRMFLAYPQTKTYFP HFDLHPGSEQVRGHGKKVAAALGNAVKSLDNLSQALSELSNLHAYNLRVDPVNFKLLA QCFQVVLAAHLGKDYSPEMHAAFDKFLSAVAAVLAEKYR" repeat_region 227..246 /note="direct repeat 1" intron 235..386 /number=1 repeat_region 289..309 /note="direct repeat 1" exon 387..591 /number=2 intron 592..939 /number=2 exon 940..1114 /number=3 polyA_signal 1095..1100 polyA_signal 1114

GenBank format - FEATURE section

Page 46: Introduktion til Bioinformatik Hold 01 Oktober 2010.

Exercise: GenBank

• Work in groups of 2-3 people.

• The exercise guide is linked from the course programme.

• Read the guide carefully - it contains a lot of information about GenBank.