Introduction to Bioinformatics

125
V. K. Singh Information officer Centre for Bioinformatics Banaras Hindu University ntroduction to Bioinformatic

description

Elements of Bioinformatics

Transcript of Introduction to Bioinformatics

Page 1: Introduction to Bioinformatics

V. K. SinghInformation officerCentre for BioinformaticsBanaras Hindu University

Introduction to Bioinformatics

Page 2: Introduction to Bioinformatics

What is Bioinformatics

“The analysis of biological information using computers and statistical techniques; the science of developing and utilizing computer databases and algorithms to accelerate and enhance biological research”

www.niehs.nih.gov/dert/trc/glossary.htm

1: Introduction

Page 3: Introduction to Bioinformatics

What Bioinformatics can offer to biologists?

1: Introduction

Page 4: Introduction to Bioinformatics

1: Introduction

Computational biology – Insilico genome revolution at the turn of the century.

Page 5: Introduction to Bioinformatics

•Life was classified as

plants and animals

•When Bacteria were discoveredthey were initially classified as plants.

•Ernst Haeckel (1866) placed all unicellular organisms in a kingdom called Protista, separated from Plantae and Animalia.

In the very beginning

1: Introduction

Page 6: Introduction to Bioinformatics

1: Introduction

Page 7: Introduction to Bioinformatics

Thus, life were classified to 5 kingdoms:

When electron microscopes were developed, it was found that Protista in fact include both cells with and without nucleus. Also, fungi were found to differ from plants, since they are heterotrophs (they do not synthesize their food).

LIFE

FungiPlants Animals ProtistsProcaryotes

1: Introduction

Page 8: Introduction to Bioinformatics

Later, plants, animals, protists and fungi were collectively called the Eucarya domain, and the procaryotes were shifted from a kingdom to be a Bacteria domain.

Domains EucaryaBacteria

FungiPlants Animals ProtistsKingdoms

Even later, a new Domain was discovered…

1: Introduction

Page 9: Introduction to Bioinformatics

rRNA was sequenced from a great number of organisms to study phylogeny

1: Introduction

Page 10: Introduction to Bioinformatics

Revolutionizing the Classification of Life

1: Introduction

The rRNA phylogenetic tree

Page 11: Introduction to Bioinformatics

From sequence analysis only, it was thus established that life is divided into 3:BacteriaArchaeaEucarya

1: Introduction

Page 12: Introduction to Bioinformatics

Gregor Mendellaws of inheritance,“gene”1866

Watson and Crick

DNA Discovery 1953

Genome

Project 2003

1: Introduction

Page 13: Introduction to Bioinformatics

Sequencing of Genomes

Page 14: Introduction to Bioinformatics
Page 15: Introduction to Bioinformatics
Page 16: Introduction to Bioinformatics

Genomic Sequencing – shotgun sequencing

Sequencing is usually ~700 bp in a single run.

How can we sequence a genome?

1: Introduction

Page 17: Introduction to Bioinformatics

Genomic Sequencing – Walking.

1.Design a primer2.Sequence.3.Design a new primer4.Sequence5.…

One has to design new primers every time. To do so, one has to wait for the sequencing results

1: Introduction

Page 18: Introduction to Bioinformatics

GAGGAGACGAACACCCGTATACAGTCGACG

ACCCCGAGGAGACGAACACCCGTATACAGTCGACGTTTATATATA

GTATACAGTCGACGTTTATATATA

ACCCCGAGGAGACGA

Genomic Sequencing – shotgun sequencing

1. Break DNA to small pieces2. Sequence each piece3. Assemble

1: Introduction

Page 19: Introduction to Bioinformatics

GAGGAGACGAACACCCGTATACAGTCGACG

ACCCCGAGGAGACGA ? GTATACAGTCGACGTTTATATATA

GTATACAGTCGACGTTTATATATA

ACCCCGAGGAGACGA

Shotgun sequencing – why isn’t it a trivial task?

1. By chance, some parts are not sequenced even once!!!

1: Introduction

Page 20: Introduction to Bioinformatics

Shotgun sequencing – why isn’t it a trivial task?

2. Some pieces do not align because of sequencing errors

GAGGTGAGGAACACCCGTATACAGTCGACG

ACCCCGAGG?GA?GAACACCCGTATACAGTCGACGTTTATATATA

ACCCCGAGGAGACGA

1: Introduction

Page 21: Introduction to Bioinformatics

Shotgun sequencing – why not a trivial task?

3. Repetitive sequences –satellites DNA.

GGGGGGGGGGGGGGGGGGGGGGGGGGGG

ACCCCGGGGGGGGGGGGG????GGGGGGGGGGGGGA

GGGGGGGGGGGGGGGGGGGGGGA

ACCCCGGGGG

1: Introduction

Page 22: Introduction to Bioinformatics

A section of the genome that could be reliably assembled.

A contig

1: Introduction

Page 23: Introduction to Bioinformatics

23

BIOINFORMATICS DATABASES

Page 24: Introduction to Bioinformatics

24

What’s in a database?• Sequences – genes, proteins, etc…• Full genomes• Expression data• Structures• Annotation – information about genes/proteins:

- function- cellular location- chromosomal location- introns/exons- phenotypes, diseases

• Publications

Page 25: Introduction to Bioinformatics

25

NCBI and Entrez

• One of the most largest and comprehensive databases belonging to the NIH (national institute of health. The primary Federal agency for conducting and supporting medical research in the USA)

• Entrez is the search engine of NCBI• Search for :

genes, proteins, genomes, structures, diseases, publications, and more

http://www.ncbi.nlm.nih.gov

Page 26: Introduction to Bioinformatics
Page 27: Introduction to Bioinformatics
Page 28: Introduction to Bioinformatics
Page 29: Introduction to Bioinformatics
Page 30: Introduction to Bioinformatics
Page 31: Introduction to Bioinformatics
Page 32: Introduction to Bioinformatics

32

PubMed: NCBI’s database of biomedical articles

Yang X, Kurteva S, Ren X, Lee S, Sodroski J. “Subunit stoichiometry Yang X, Kurteva S, Ren X, Lee S, Sodroski J. “Subunit stoichiometry of human immunodeficiency virus type 1 envelope glycoprotein trimers of human immunodeficiency virus type 1 envelope glycoprotein trimers during virus entry into host cells “, J Virol. 2006 May;80(9):4388-95.during virus entry into host cells “, J Virol. 2006 May;80(9):4388-95.

Page 33: Introduction to Bioinformatics

33

Use fields!Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA]

For the full list of field tags: go to help -> Search Field Descriptions and Tags

Page 34: Introduction to Bioinformatics

34

Example

• Retrieve all publications in which the first author is: Davidovich C and the last author is: Yonath A

Page 35: Introduction to Bioinformatics

35

Using limits

Retrieve the publications of Yonath A, in the journals: Nature and Proc Natl Acad Sci U S A., in the last 5 years

Page 36: Introduction to Bioinformatics

36

Searching NCBI for the protein human CD4

Search demonstrationSearch demonstration

Page 37: Introduction to Bioinformatics

37

Page 38: Introduction to Bioinformatics

38

Using field descriptions, qualifiers, and boolean operators

• Cd4[GENE] AND human[ORGN] Or Cd4[gene name] AND human[organism]

• List of field codes: http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers

– Boolean Operators:ANDORNOT

Note: do not use the field Protein name [PROT], only GENE!

Page 39: Introduction to Bioinformatics

39

This time we directly search in the protein databaseThis time we directly search in the protein database

Page 40: Introduction to Bioinformatics

40

RefSeq• Subcollection of NCBI databases with only non-

redundant, highly annotated entries (genomic DNA, transcript (RNA), and protein products)

Page 41: Introduction to Bioinformatics

41

Page 42: Introduction to Bioinformatics

42An explanation on GenBank records

Page 43: Introduction to Bioinformatics

4343

Fasta format

> gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens] MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCVRCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI

Save accession numbers for future use (makes searching quicker):RefSeq accession number: NP_000607.1

header

ID/accession description

sequence

Page 44: Introduction to Bioinformatics

4444

Downloading

Page 45: Introduction to Bioinformatics

Homology Search Using

Sequence Alignment

Page 46: Introduction to Bioinformatics

|| || ||||| ||| || || |||||||||||||||||||MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE…

ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCCCTGGCTCACAAGTACCATTGA

MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…

Before we begin…

Page 47: Introduction to Bioinformatics

What is sequence alignment?

Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.

MVNLTSDEKTAVLALWNKVDVEDCGGE|| || ||||| ||| || || ||MVHLTPEEKTAVNALWGKVNVDAVGGE

Page 48: Introduction to Bioinformatics

Why sequence alignment?

Predict characteristics of a protein – use the structure or function information on known proteins with similar sequences available in databases in order to predict the structure or function of an unknown protein

Assumptions: similar sequences produce similar proteins

Page 49: Introduction to Bioinformatics

Local vs. Global• Global alignment – finds the best

alignment across the whole two sequences.

• Local alignment – finds regions of high similarity in parts of the sequences.

ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ

ADLG CDRYFQ|||| |||| |ADLG CDRYYQ

Global alignment:

forces alignment in

regions which differ

Local alignment

concentrates on regions of high similarity

Page 50: Introduction to Bioinformatics

In the course of evolution, the sequences changed from the ancestral sequence by random mutations

Three types of changes:1. Insertion - an insertion of a letter or several letters to the sequence.

AAGA AAGTA

Sequence evolution

AAGAAGAA

InsertionInsertion

Page 51: Introduction to Bioinformatics

In the course of evolution, the sequences changed from the ancestral sequence by random mutations

Three types of changes :1. Insertion - an insertion of a letter or several letters to the sequence.

AAGA AAGTA2. Deletion – a deletion of a letter (or more) from the sequence.

AAGA AGA

Sequence evolution

AA AGAG

DeletionDeletion

AA

Page 52: Introduction to Bioinformatics

In the course of evolution, the sequences changed from the ancestral sequence by random mutations

Three types of mutations:1. Insertion - an insertion of a letter or several letters to the sequence.

AAGA AAGTA2. Deletion - deleting a letter (or more) from the sequence.

AAGA AGA3. Substitution – a replacement of one (or more) sequence letter by

another AAGA AACA

Evolutionary changes in sequences

AAAA AA

SubstitutionSubstitution

GGCCInsertionInsertion + + DeletionDeletion IndelIndel

Page 53: Introduction to Bioinformatics

Sequence alignment

AAGCTGAATTCGAAAGGCTCATTTCTGA

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

One possible alignment:

This alignment includes:

2 mismatches 4 indels (gap)

10 perfect matches

Page 54: Introduction to Bioinformatics

Choosing an alignment:

• Many different alignments are possible:

AAGCTGAATTCGAAAGGCTCATTTCTGA

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

Which alignment is better?

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

Page 55: Introduction to Bioinformatics

Scoring an alignment:example - naïve scoring system:

• Match: +1• Mismatch: -2• Indel: -1

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2 Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

Higher score Better alignment

Page 56: Introduction to Bioinformatics

Scoring system:

• Different scoring systems can produce different optimal alignments

• Scoring systems implicitly represent a particular theory of similarity/dissimilarity between sequence characters: evolution based, physico-chemical properties based – Some mismatches are more plausible

• Transition vs. Transversion • LysArg ≠ LysCys

– Gap extension Vs. Gap opening

Page 57: Introduction to Bioinformatics

Substitutions Matrices

• Nucleic acids:– Transition-transversion

• Amino acids:– Evolution (empirical data) based: (PAM, BLOSUM)– Physico-chemical properties based (Grantham,

McLachlan)

Page 58: Introduction to Bioinformatics

Web server for pairwise alignment

Page 59: Introduction to Bioinformatics

BLAST 2 sequences (bl2Seq) at NCBI

Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool) engine for local alignment

• Does not use an exact algorithm but a heuristic

Page 60: Introduction to Bioinformatics

Back to NCBI

Page 61: Introduction to Bioinformatics

BLAST – bl2seq

Page 62: Introduction to Bioinformatics

blastnblastn – nucleotide – nucleotide

blastpblastp – protein – protein

Bl2Seq - query

Page 63: Introduction to Bioinformatics

Bl2seq results

Page 64: Introduction to Bioinformatics

Bl2seq results

MatchMatch DissimilarityDissimilarity GapsGaps SimilaritySimilarity Low Low

complexitycomplexity

Page 65: Introduction to Bioinformatics

Bl2seq results:

• Bits score – A score for the alignment according to the number of similarities, identities, etc.

• Expected-score (E-value) –The number of alignments with the same score one can “expect” to see by chance when searching a database of a particular size. The closer the e-value approaches zero, the greater the confidence that the hit is real

Page 66: Introduction to Bioinformatics

BLAST – programs

Query: DNA Protein

Database: DNA Protein

Page 67: Introduction to Bioinformatics

BLAST – Blastp

Page 68: Introduction to Bioinformatics

Blastp - results

Page 69: Introduction to Bioinformatics

Blastp – results (cont’)

Page 70: Introduction to Bioinformatics

Blastp – acquiring sequences

Page 71: Introduction to Bioinformatics

blastp – acquiring sequences (cont’)

Page 72: Introduction to Bioinformatics

Multiple Sequence Alignment (MSA)

andPhylogeny

Page 73: Introduction to Bioinformatics

One of the options to get multiple sequence Fasta file

Page 74: Introduction to Bioinformatics

One of the options to get multiple sequence Fasta file

Page 75: Introduction to Bioinformatics

Input: multiple sequence Fasta file>gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein [Homo sapiens]MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECLISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQLQGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS

>gi|114051746|ref|NP_001040585.1| protease, serine, 2 [Macaca mulatta]MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQVRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEALISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQLQGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS

>gi|6755891|ref|NP_035775.1| mesotrypsin [Mus musculus]MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCLISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNRELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN

>gi|6981422|ref|NP_036861.1| protease, serine, 2 [Rattus norvegicus]MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN

>gi|27819626|ref|NP_777115.1| pancreatic anionic trypsinogen [Bos taurus]MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQVRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL. . .

Page 76: Introduction to Bioinformatics

Input: multiple sequence Fasta file>gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein [Homo sapiens]MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECLISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQLQGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS

>gi|114051746|ref|NP_001040585.1| protease, serine, 2 [Macaca mulatta]MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQVRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEALISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQLQGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS

>gi|6755891|ref|NP_035775.1| mesotrypsin [Mus musculus]MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCLISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNRELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN

>gi|6981422|ref|NP_036861.1| protease, serine, 2 [Rattus norvegicus]MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN

>gi|27819626|ref|NP_777115.1| pancreatic anionic trypsinogen [Bos taurus]MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQVRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL. . .

Page 77: Introduction to Bioinformatics

Step1: Load the sequences

Page 78: Introduction to Bioinformatics

Sequences and conservation view

Page 79: Introduction to Bioinformatics

Step2: Perform Alignment

Page 80: Introduction to Bioinformatics

Sequences and conservation view

Page 81: Introduction to Bioinformatics

Sequences and conservation view

Page 82: Introduction to Bioinformatics

Step 3: Create tree

Page 83: Introduction to Bioinformatics

Step 4: NJPlot

Page 84: Introduction to Bioinformatics

• We need some statistical way to estimate the confidence in the tree topology

• But we don’t know anything about the tree topology distribution or parameters

• The only data source we have is our data (MSA)

• So, we must rely on our own resources: “pull up by your own bootstraps”

How robust is our tree?

Page 85: Introduction to Bioinformatics

Bootstrap

1. Resample K positions n times

12345 K1 : ATCTG…A 2 : ATCTG…C3 : ACTTA…C N : ACCTA…T

11244 K1 : AATTT…T2 : AATTT…G3 : AACTT…TN : AACTT…T

47789…K1 : TTTAT…T2 : TAACC…G3 : TAACC…TN : TGGGA…T

15578… K1 : AGGTA…T2 : AGGAC…G3 : AAAAC…AN : AAAGG…C

Page 86: Introduction to Bioinformatics

Bootstrap2. Reconstruct a tree from each data set using the same method used for reconstructing the original tree

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3Sp4

11244 K1 : AATTT…T2 : AATTT…G3 : AACTT…TN : AACTT…T

47789…K1 : TTTAT…T2 : TAACC…G3 : TAACC…TN : TGGGA…T

15578… K1 : AGGTA…T2 : AGGAC…G3 : AAAAC…AN : AAAGG…C

Page 87: Introduction to Bioinformatics

Bootstrap3. For each node in our original tree, we count the number of times it appeared in the bootstrap analysis

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3

Sp4

67%100%

Page 88: Introduction to Bioinformatics

Step 3.5 - Bootstrap

Page 89: Introduction to Bioinformatics

Bootstrap values on NJPlot

Note:ClustalX saves trees as .ph filetrees with bootstrap are saved as .phb

You might have to reopen the tree…

Page 90: Introduction to Bioinformatics

Protein information Resource

• Swissprot• PDB

Page 91: Introduction to Bioinformatics

91

Swissprot

• A protein sequence database which strives to provide a high level of annotation regarding:* the function of a protein* domains structure* post-translational modifications* variants

• One entry for each protein

http://www.expasy.ch/sprot

Page 92: Introduction to Bioinformatics

92

Page 93: Introduction to Bioinformatics

93

PDB: Protein Data Bank

• Main database of 3D structures of macromolecules

• Includes ~61,000 entries (proteins, nucleic acids, complex assemblies)

• Is highly redundant

http://www.rcsb.org

Page 94: Introduction to Bioinformatics

94

Human CD4 in complex with HIV gp120

gp120

CD4

PDB ID 1G9M

Page 95: Introduction to Bioinformatics

What do bioinformaticians study?

• Bioinformatics today is part of almost every molecular biological research.

• Just a few examples…

1: Introduction

Page 96: Introduction to Bioinformatics

Example 1

• Compare proteins with similar sequences (for instance –kinases) and understand what the similarities and differences mean

1: Introduction

Page 97: Introduction to Bioinformatics

Example 2

• Look at the genome and predict where genes are (promoters; transcription binding sites; introns; exons)

1: Introduction

Page 98: Introduction to Bioinformatics

• Predict the 3-dimensional structure of a protein from its primary sequence

Example 3

Ab-initio prediction – extremely difficult!

1: Introduction

Page 99: Introduction to Bioinformatics

• Correlate between gene expression and disease

Example 4

A gene chip – quantifying gene expression in different tissues under different conditions

May be used for personalized medicine

1: Introduction

Page 100: Introduction to Bioinformatics

Role of Centre for Bioinformatics in School of Biotechnology, BHU

Page 101: Introduction to Bioinformatics

MAI N

©1996-2007 All Rights Reserved. Online J ournal of Bioinformatics . You may not store these pages in any form except for your own personal use. All other usage or distribution is illegal under international copyright treaties. Permission to use any of these pages in any other way besides the before mentioned must be gained in writing from the publisher. This article is exclusively copyrighted in its entirety to OJB publications. This article may be copied once but may not be, reproduced or re-transmitted without the express permission of the editors. This journal satisfies the refereeing requirements (DEST) for the Higher Education Research Data Collection (Australia). Linking:To link to this page or any pages linking to this page you must link directly to this page only here rather than put up your own page.

OJBTM

Online Journal of Bioinformatics©

8 (1) : 75-83, 2007

In silico Cis-regulatory Elements Analysis of Seed Storage Protein Promoters Cloned from Different

Cultivars of Wheat, Rice and Oat

Yadav D1, Singh VK1, Singh NK2

1Department of Molecular Biology and Genetic Engineering, College of Basic Sciences and Humanities G.B. pant University of Agriculture and Technology, Pantnagar (Uttarakhand) 2National Research Center on Plant Biotechnology Indian Agriculture Research Institute, New Delhi 110012

ABSTRACT

A total of 24 promoter sequences with assigned accession number EF393165 to EF393188 and representing major seed storage proteins of wheat namely High molecular weight glutenin subunit (HMW-GS), low molecular weight glutenin subunits (LMW-GS) alpha/beta gliadins, triticin along with rice glutelins and oat 12S globulins were cloned from indigenous cultivars of wheat, rice and oat and was subjected to in silico analysis using bioinformatic softwares for the presence of different cis-regulatory motifs. The phylogeny studies based on the multiple sequence alignment of these promoters revealed four distinct clusters showing major group of seed storage promoters. The

presence of additional motifs like RY repeats, ABRE, AC-11, CAAT box, LTR, UTR, CCGTCC box, G box, GARE, MBS along with the common motifs present in seed storage promoters like Prolamin-box, TATA, CAAT provides a better option for multifarious uses. Keywords: Seed storage protein promoters, Cis-regulatory Elements, In silico.

Seed Storage Protein Promoters

Accession Number

Cultivars Length(bp)

HMW Glutenin(Triticum aestivum)

EF396165EF396184EF396166EF396167EF396168EF396169EF396170EF396171EF396172EF396173

UP-262UP-262UP-262UP-262UP-262UP-301UP-301UP-301UP-301UP-301

402487397412385385393398392393

LMW Glutenin(Triticum aestivum)

EF396187 HD-2329 551

/ gliadin (Triticum aestivum)

EF396174EF396175EF396177EF396178EF396176EF396182

KalyansonaKalyansonaUP-262UP-262UP-262UP-301

520564591521563548

Triticin (Triticum aestivum)

EF396181EF396183EF396185EF396186

HD-2329HD-2329HD-2329Kalyansona

428370452343

12S Globulin ( Avena sativa)

EF396179 UPO-94 549

Glutelins ( Oryza sativa)

EF396180EF396188

PantDhan-12Pusa Basmati

562487

Page 102: Introduction to Bioinformatics
Page 103: Introduction to Bioinformatics
Page 104: Introduction to Bioinformatics

200 bp

172 bp

Motif-1Motif-2 Motif-3

CCC CZinc- finger

AAY28423 Piceaabies

EAY88711 Oryzasativaindica

ABI16029 Glycinemax

XP 001751505 Physcomitrella

CAC85949 Hordeumvulgare

EAY73401 Oryzasativaindica

CAO15000 Vitisvinifera

ABN08462 Medicagotruncatula

NP 001042660 Oryzasativajaponi

XP 001759349 Physcomitrellapat

ACF80167 Zeamays

EAZ41131 Oryzasativa

ACF06723 Paspalumscrobiculatum

ACC59765 Eleusinecoracana

CAN79859 Vitisvinifera

NP 001060673 Oryzasativajaponi

ACF06725 Echinochloafrumentace

EAZ05181 Oryzasativaindica

ACF06718 Sorghumbicolor

CAO64539 Vitisvinifera

ACF06722 Panicumantidotale

ABQ42348 Glycinemax

ACF06719 Hordeumvulgare

ACF06726 Triticumaestivum

ACF06720 Avenasativa

ACF81642 Zeamays

ACF06721 Panicummilliaceum

CAA04440 Hordeumvulgare

CAA09976 Triticumaestivum

ACF06724 Setariaitalica

ACC59766 Oryzasativa

89

58

63

63

68

55

50

48

0.00.20.40.60.81.0

Page 105: Introduction to Bioinformatics
Page 106: Introduction to Bioinformatics

MAI N

©1996-2007 All Rights Reserved. Online J ournal of Bioinformatics . You may not store these pages in any form except for your own personal use. All other usage or distribution is illegal under international copyright treaties. Permission to use any of these pages in any other way besides the before mentioned must be gained in writing from the publisher. This article is exclusively copyrighted in its entirety to OJB publications. This article may be copied once but may not be, reproduced or re-transmitted without the express permission of the editors. This journal satisfies the refereeing requirements (DEST) for the Higher Education Research Data Collection (Australia). Linking:To link to this page or any pages linking to this page you must link directly to this page only here rather than put up your own page.

OJBTM

Online Journal of Bioinformatics©

8 (1) : 75-83, 2007

In silico Cis-regulatory Elements Analysis of Seed Storage Protein Promoters Cloned from Different

Cultivars of Wheat, Rice and Oat

Yadav D1, Singh VK1, Singh NK2

1Department of Molecular Biology and Genetic Engineering, College of Basic Sciences and Humanities G.B. pant University of Agriculture and Technology, Pantnagar (Uttarakhand) 2National Research Center on Plant Biotechnology Indian Agriculture Research Institute, New Delhi 110012

ABSTRACT

A total of 24 promoter sequences with assigned accession number EF393165 to EF393188 and representing major seed storage proteins of wheat namely High molecular weight glutenin subunit (HMW-GS), low molecular weight glutenin subunits (LMW-GS) alpha/beta gliadins, triticin along with rice glutelins and oat 12S globulins were cloned from indigenous cultivars of wheat, rice and oat and was subjected to in silico analysis using bioinformatic softwares for the presence of different cis-regulatory motifs. The phylogeny studies based on the multiple sequence alignment of these promoters revealed four distinct clusters showing major group of seed storage promoters. The

presence of additional motifs like RY repeats, ABRE, AC-11, CAAT box, LTR, UTR, CCGTCC box, G box, GARE, MBS along with the common motifs present in seed storage promoters like Prolamin-box, TATA, CAAT provides a better option for multifarious uses. Keywords: Seed storage protein promoters, Cis-regulatory Elements, In silico.

Seed Storage Protein Promoters

Accession Number

Cultivars Length

)bp(

HMW Glutenin)Triticum

aestivum(

EF396165EF396184EF396166EF396167EF396168EF396169EF396170EF396171EF396172EF396173

UP-262UP-262UP-262UP-262UP-262UP-301UP-301UP-301UP-301UP-301

402487397412385385393398392393

LMW Glutenin)Triticum

aestivum(

EF396187 HD-2329 551

/ gliadin (Triticum aestivum)

EF396174EF396175EF396177EF396178EF396176EF396182

KalyansonaKalyansona

UP-262UP-262UP-262UP-301

520564591521563548

Triticin (Triticum aestivum)

EF396181EF396183EF396185EF396186

HD-2329HD-2329HD-2329

Kalyansona

428370452343

12S Globulin ( Avena sativa)

EF396179 UPO-94 549

Glutelins ( Oryza sativa)

EF396180EF396188

PantDhan-12Pusa Basmati

562487

Page 107: Introduction to Bioinformatics
Page 108: Introduction to Bioinformatics

http://www.insilicogenomics.in/cry-bt-search.asp

Page 109: Introduction to Bioinformatics
Page 110: Introduction to Bioinformatics
Page 111: Introduction to Bioinformatics
Page 112: Introduction to Bioinformatics
Page 113: Introduction to Bioinformatics
Page 114: Introduction to Bioinformatics

CERCOSPORA LEAF SPOT DISEASE OF PIGEONPEA AND ITS MANAGEMENT

Page 115: Introduction to Bioinformatics
Page 116: Introduction to Bioinformatics
Page 117: Introduction to Bioinformatics
Page 118: Introduction to Bioinformatics
Page 119: Introduction to Bioinformatics
Page 120: Introduction to Bioinformatics
Page 121: Introduction to Bioinformatics
Page 122: Introduction to Bioinformatics
Page 123: Introduction to Bioinformatics
Page 124: Introduction to Bioinformatics
Page 125: Introduction to Bioinformatics