Introduction to Bioinformatics
-
Upload
vinay-singh -
Category
Education
-
view
781 -
download
4
description
Transcript of Introduction to Bioinformatics
V. K. SinghInformation officerCentre for BioinformaticsBanaras Hindu University
Introduction to Bioinformatics
What is Bioinformatics
“The analysis of biological information using computers and statistical techniques; the science of developing and utilizing computer databases and algorithms to accelerate and enhance biological research”
www.niehs.nih.gov/dert/trc/glossary.htm
1: Introduction
What Bioinformatics can offer to biologists?
1: Introduction
1: Introduction
Computational biology – Insilico genome revolution at the turn of the century.
•Life was classified as
plants and animals
•When Bacteria were discoveredthey were initially classified as plants.
•Ernst Haeckel (1866) placed all unicellular organisms in a kingdom called Protista, separated from Plantae and Animalia.
In the very beginning
1: Introduction
1: Introduction
Thus, life were classified to 5 kingdoms:
When electron microscopes were developed, it was found that Protista in fact include both cells with and without nucleus. Also, fungi were found to differ from plants, since they are heterotrophs (they do not synthesize their food).
LIFE
FungiPlants Animals ProtistsProcaryotes
1: Introduction
Later, plants, animals, protists and fungi were collectively called the Eucarya domain, and the procaryotes were shifted from a kingdom to be a Bacteria domain.
Domains EucaryaBacteria
FungiPlants Animals ProtistsKingdoms
Even later, a new Domain was discovered…
1: Introduction
rRNA was sequenced from a great number of organisms to study phylogeny
1: Introduction
Revolutionizing the Classification of Life
1: Introduction
The rRNA phylogenetic tree
From sequence analysis only, it was thus established that life is divided into 3:BacteriaArchaeaEucarya
1: Introduction
Gregor Mendellaws of inheritance,“gene”1866
Watson and Crick
DNA Discovery 1953
Genome
Project 2003
1: Introduction
Sequencing of Genomes
Genomic Sequencing – shotgun sequencing
Sequencing is usually ~700 bp in a single run.
How can we sequence a genome?
1: Introduction
Genomic Sequencing – Walking.
1.Design a primer2.Sequence.3.Design a new primer4.Sequence5.…
One has to design new primers every time. To do so, one has to wait for the sequencing results
1: Introduction
GAGGAGACGAACACCCGTATACAGTCGACG
ACCCCGAGGAGACGAACACCCGTATACAGTCGACGTTTATATATA
GTATACAGTCGACGTTTATATATA
ACCCCGAGGAGACGA
Genomic Sequencing – shotgun sequencing
1. Break DNA to small pieces2. Sequence each piece3. Assemble
1: Introduction
GAGGAGACGAACACCCGTATACAGTCGACG
ACCCCGAGGAGACGA ? GTATACAGTCGACGTTTATATATA
GTATACAGTCGACGTTTATATATA
ACCCCGAGGAGACGA
Shotgun sequencing – why isn’t it a trivial task?
1. By chance, some parts are not sequenced even once!!!
1: Introduction
Shotgun sequencing – why isn’t it a trivial task?
2. Some pieces do not align because of sequencing errors
GAGGTGAGGAACACCCGTATACAGTCGACG
ACCCCGAGG?GA?GAACACCCGTATACAGTCGACGTTTATATATA
ACCCCGAGGAGACGA
1: Introduction
Shotgun sequencing – why not a trivial task?
3. Repetitive sequences –satellites DNA.
GGGGGGGGGGGGGGGGGGGGGGGGGGGG
ACCCCGGGGGGGGGGGGG????GGGGGGGGGGGGGA
GGGGGGGGGGGGGGGGGGGGGGA
ACCCCGGGGG
1: Introduction
A section of the genome that could be reliably assembled.
A contig
1: Introduction
23
BIOINFORMATICS DATABASES
24
What’s in a database?• Sequences – genes, proteins, etc…• Full genomes• Expression data• Structures• Annotation – information about genes/proteins:
- function- cellular location- chromosomal location- introns/exons- phenotypes, diseases
• Publications
25
NCBI and Entrez
• One of the most largest and comprehensive databases belonging to the NIH (national institute of health. The primary Federal agency for conducting and supporting medical research in the USA)
• Entrez is the search engine of NCBI• Search for :
genes, proteins, genomes, structures, diseases, publications, and more
http://www.ncbi.nlm.nih.gov
32
PubMed: NCBI’s database of biomedical articles
Yang X, Kurteva S, Ren X, Lee S, Sodroski J. “Subunit stoichiometry Yang X, Kurteva S, Ren X, Lee S, Sodroski J. “Subunit stoichiometry of human immunodeficiency virus type 1 envelope glycoprotein trimers of human immunodeficiency virus type 1 envelope glycoprotein trimers during virus entry into host cells “, J Virol. 2006 May;80(9):4388-95.during virus entry into host cells “, J Virol. 2006 May;80(9):4388-95.
33
Use fields!Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA]
For the full list of field tags: go to help -> Search Field Descriptions and Tags
34
Example
• Retrieve all publications in which the first author is: Davidovich C and the last author is: Yonath A
35
Using limits
Retrieve the publications of Yonath A, in the journals: Nature and Proc Natl Acad Sci U S A., in the last 5 years
36
Searching NCBI for the protein human CD4
Search demonstrationSearch demonstration
37
38
Using field descriptions, qualifiers, and boolean operators
• Cd4[GENE] AND human[ORGN] Or Cd4[gene name] AND human[organism]
• List of field codes: http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers
– Boolean Operators:ANDORNOT
Note: do not use the field Protein name [PROT], only GENE!
39
This time we directly search in the protein databaseThis time we directly search in the protein database
40
RefSeq• Subcollection of NCBI databases with only non-
redundant, highly annotated entries (genomic DNA, transcript (RNA), and protein products)
41
42An explanation on GenBank records
4343
Fasta format
> gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens] MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCVRCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI
Save accession numbers for future use (makes searching quicker):RefSeq accession number: NP_000607.1
header
ID/accession description
sequence
4444
Downloading
Homology Search Using
Sequence Alignment
|| || ||||| ||| || || |||||||||||||||||||MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE…
ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCCCTGGCTCACAAGTACCATTGA
MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…
Before we begin…
What is sequence alignment?
Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.
MVNLTSDEKTAVLALWNKVDVEDCGGE|| || ||||| ||| || || ||MVHLTPEEKTAVNALWGKVNVDAVGGE
Why sequence alignment?
Predict characteristics of a protein – use the structure or function information on known proteins with similar sequences available in databases in order to predict the structure or function of an unknown protein
Assumptions: similar sequences produce similar proteins
Local vs. Global• Global alignment – finds the best
alignment across the whole two sequences.
• Local alignment – finds regions of high similarity in parts of the sequences.
ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ
ADLG CDRYFQ|||| |||| |ADLG CDRYYQ
Global alignment:
forces alignment in
regions which differ
Local alignment
concentrates on regions of high similarity
In the course of evolution, the sequences changed from the ancestral sequence by random mutations
Three types of changes:1. Insertion - an insertion of a letter or several letters to the sequence.
AAGA AAGTA
Sequence evolution
AAGAAGAA
InsertionInsertion
In the course of evolution, the sequences changed from the ancestral sequence by random mutations
Three types of changes :1. Insertion - an insertion of a letter or several letters to the sequence.
AAGA AAGTA2. Deletion – a deletion of a letter (or more) from the sequence.
AAGA AGA
Sequence evolution
AA AGAG
DeletionDeletion
AA
In the course of evolution, the sequences changed from the ancestral sequence by random mutations
Three types of mutations:1. Insertion - an insertion of a letter or several letters to the sequence.
AAGA AAGTA2. Deletion - deleting a letter (or more) from the sequence.
AAGA AGA3. Substitution – a replacement of one (or more) sequence letter by
another AAGA AACA
Evolutionary changes in sequences
AAAA AA
SubstitutionSubstitution
GGCCInsertionInsertion + + DeletionDeletion IndelIndel
Sequence alignment
AAGCTGAATTCGAAAGGCTCATTTCTGA
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
One possible alignment:
This alignment includes:
2 mismatches 4 indels (gap)
10 perfect matches
Choosing an alignment:
• Many different alignments are possible:
AAGCTGAATTCGAAAGGCTCATTTCTGA
A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-
Which alignment is better?
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
Scoring an alignment:example - naïve scoring system:
• Match: +1• Mismatch: -2• Indel: -1
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2 Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1
A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-
Higher score Better alignment
Scoring system:
• Different scoring systems can produce different optimal alignments
• Scoring systems implicitly represent a particular theory of similarity/dissimilarity between sequence characters: evolution based, physico-chemical properties based – Some mismatches are more plausible
• Transition vs. Transversion • LysArg ≠ LysCys
– Gap extension Vs. Gap opening
Substitutions Matrices
• Nucleic acids:– Transition-transversion
• Amino acids:– Evolution (empirical data) based: (PAM, BLOSUM)– Physico-chemical properties based (Grantham,
McLachlan)
Web server for pairwise alignment
BLAST 2 sequences (bl2Seq) at NCBI
Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool) engine for local alignment
• Does not use an exact algorithm but a heuristic
Back to NCBI
BLAST – bl2seq
blastnblastn – nucleotide – nucleotide
blastpblastp – protein – protein
Bl2Seq - query
Bl2seq results
Bl2seq results
MatchMatch DissimilarityDissimilarity GapsGaps SimilaritySimilarity Low Low
complexitycomplexity
Bl2seq results:
• Bits score – A score for the alignment according to the number of similarities, identities, etc.
• Expected-score (E-value) –The number of alignments with the same score one can “expect” to see by chance when searching a database of a particular size. The closer the e-value approaches zero, the greater the confidence that the hit is real
BLAST – programs
Query: DNA Protein
Database: DNA Protein
BLAST – Blastp
Blastp - results
Blastp – results (cont’)
Blastp – acquiring sequences
blastp – acquiring sequences (cont’)
Multiple Sequence Alignment (MSA)
andPhylogeny
One of the options to get multiple sequence Fasta file
One of the options to get multiple sequence Fasta file
Input: multiple sequence Fasta file>gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein [Homo sapiens]MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECLISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQLQGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS
>gi|114051746|ref|NP_001040585.1| protease, serine, 2 [Macaca mulatta]MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQVRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEALISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQLQGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS
>gi|6755891|ref|NP_035775.1| mesotrypsin [Mus musculus]MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCLISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNRELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN
>gi|6981422|ref|NP_036861.1| protease, serine, 2 [Rattus norvegicus]MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN
>gi|27819626|ref|NP_777115.1| pancreatic anionic trypsinogen [Bos taurus]MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQVRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL. . .
Input: multiple sequence Fasta file>gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein [Homo sapiens]MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECLISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQLQGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS
>gi|114051746|ref|NP_001040585.1| protease, serine, 2 [Macaca mulatta]MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQVRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEALISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQLQGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS
>gi|6755891|ref|NP_035775.1| mesotrypsin [Mus musculus]MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCLISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNRELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN
>gi|6981422|ref|NP_036861.1| protease, serine, 2 [Rattus norvegicus]MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN
>gi|27819626|ref|NP_777115.1| pancreatic anionic trypsinogen [Bos taurus]MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQVRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL. . .
Step1: Load the sequences
Sequences and conservation view
Step2: Perform Alignment
Sequences and conservation view
Sequences and conservation view
Step 3: Create tree
Step 4: NJPlot
• We need some statistical way to estimate the confidence in the tree topology
• But we don’t know anything about the tree topology distribution or parameters
• The only data source we have is our data (MSA)
• So, we must rely on our own resources: “pull up by your own bootstraps”
How robust is our tree?
Bootstrap
1. Resample K positions n times
12345 K1 : ATCTG…A 2 : ATCTG…C3 : ACTTA…C N : ACCTA…T
11244 K1 : AATTT…T2 : AATTT…G3 : AACTT…TN : AACTT…T
47789…K1 : TTTAT…T2 : TAACC…G3 : TAACC…TN : TGGGA…T
15578… K1 : AGGTA…T2 : AGGAC…G3 : AAAAC…AN : AAAGG…C
Bootstrap2. Reconstruct a tree from each data set using the same method used for reconstructing the original tree
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
11244 K1 : AATTT…T2 : AATTT…G3 : AACTT…TN : AACTT…T
47789…K1 : TTTAT…T2 : TAACC…G3 : TAACC…TN : TGGGA…T
15578… K1 : AGGTA…T2 : AGGAC…G3 : AAAAC…AN : AAAGG…C
Bootstrap3. For each node in our original tree, we count the number of times it appeared in the bootstrap analysis
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3
Sp4
67%100%
Step 3.5 - Bootstrap
Bootstrap values on NJPlot
Note:ClustalX saves trees as .ph filetrees with bootstrap are saved as .phb
You might have to reopen the tree…
Protein information Resource
• Swissprot• PDB
91
Swissprot
• A protein sequence database which strives to provide a high level of annotation regarding:* the function of a protein* domains structure* post-translational modifications* variants
• One entry for each protein
http://www.expasy.ch/sprot
92
93
PDB: Protein Data Bank
• Main database of 3D structures of macromolecules
• Includes ~61,000 entries (proteins, nucleic acids, complex assemblies)
• Is highly redundant
http://www.rcsb.org
94
Human CD4 in complex with HIV gp120
gp120
CD4
PDB ID 1G9M
What do bioinformaticians study?
• Bioinformatics today is part of almost every molecular biological research.
• Just a few examples…
1: Introduction
Example 1
• Compare proteins with similar sequences (for instance –kinases) and understand what the similarities and differences mean
1: Introduction
Example 2
• Look at the genome and predict where genes are (promoters; transcription binding sites; introns; exons)
1: Introduction
• Predict the 3-dimensional structure of a protein from its primary sequence
Example 3
Ab-initio prediction – extremely difficult!
1: Introduction
• Correlate between gene expression and disease
Example 4
A gene chip – quantifying gene expression in different tissues under different conditions
May be used for personalized medicine
1: Introduction
Role of Centre for Bioinformatics in School of Biotechnology, BHU
MAI N
©1996-2007 All Rights Reserved. Online J ournal of Bioinformatics . You may not store these pages in any form except for your own personal use. All other usage or distribution is illegal under international copyright treaties. Permission to use any of these pages in any other way besides the before mentioned must be gained in writing from the publisher. This article is exclusively copyrighted in its entirety to OJB publications. This article may be copied once but may not be, reproduced or re-transmitted without the express permission of the editors. This journal satisfies the refereeing requirements (DEST) for the Higher Education Research Data Collection (Australia). Linking:To link to this page or any pages linking to this page you must link directly to this page only here rather than put up your own page.
OJBTM
Online Journal of Bioinformatics©
8 (1) : 75-83, 2007
In silico Cis-regulatory Elements Analysis of Seed Storage Protein Promoters Cloned from Different
Cultivars of Wheat, Rice and Oat
Yadav D1, Singh VK1, Singh NK2
1Department of Molecular Biology and Genetic Engineering, College of Basic Sciences and Humanities G.B. pant University of Agriculture and Technology, Pantnagar (Uttarakhand) 2National Research Center on Plant Biotechnology Indian Agriculture Research Institute, New Delhi 110012
ABSTRACT
A total of 24 promoter sequences with assigned accession number EF393165 to EF393188 and representing major seed storage proteins of wheat namely High molecular weight glutenin subunit (HMW-GS), low molecular weight glutenin subunits (LMW-GS) alpha/beta gliadins, triticin along with rice glutelins and oat 12S globulins were cloned from indigenous cultivars of wheat, rice and oat and was subjected to in silico analysis using bioinformatic softwares for the presence of different cis-regulatory motifs. The phylogeny studies based on the multiple sequence alignment of these promoters revealed four distinct clusters showing major group of seed storage promoters. The
presence of additional motifs like RY repeats, ABRE, AC-11, CAAT box, LTR, UTR, CCGTCC box, G box, GARE, MBS along with the common motifs present in seed storage promoters like Prolamin-box, TATA, CAAT provides a better option for multifarious uses. Keywords: Seed storage protein promoters, Cis-regulatory Elements, In silico.
Seed Storage Protein Promoters
Accession Number
Cultivars Length(bp)
HMW Glutenin(Triticum aestivum)
EF396165EF396184EF396166EF396167EF396168EF396169EF396170EF396171EF396172EF396173
UP-262UP-262UP-262UP-262UP-262UP-301UP-301UP-301UP-301UP-301
402487397412385385393398392393
LMW Glutenin(Triticum aestivum)
EF396187 HD-2329 551
/ gliadin (Triticum aestivum)
EF396174EF396175EF396177EF396178EF396176EF396182
KalyansonaKalyansonaUP-262UP-262UP-262UP-301
520564591521563548
Triticin (Triticum aestivum)
EF396181EF396183EF396185EF396186
HD-2329HD-2329HD-2329Kalyansona
428370452343
12S Globulin ( Avena sativa)
EF396179 UPO-94 549
Glutelins ( Oryza sativa)
EF396180EF396188
PantDhan-12Pusa Basmati
562487
200 bp
172 bp
Motif-1Motif-2 Motif-3
CCC CZinc- finger
AAY28423 Piceaabies
EAY88711 Oryzasativaindica
ABI16029 Glycinemax
XP 001751505 Physcomitrella
CAC85949 Hordeumvulgare
EAY73401 Oryzasativaindica
CAO15000 Vitisvinifera
ABN08462 Medicagotruncatula
NP 001042660 Oryzasativajaponi
XP 001759349 Physcomitrellapat
ACF80167 Zeamays
EAZ41131 Oryzasativa
ACF06723 Paspalumscrobiculatum
ACC59765 Eleusinecoracana
CAN79859 Vitisvinifera
NP 001060673 Oryzasativajaponi
ACF06725 Echinochloafrumentace
EAZ05181 Oryzasativaindica
ACF06718 Sorghumbicolor
CAO64539 Vitisvinifera
ACF06722 Panicumantidotale
ABQ42348 Glycinemax
ACF06719 Hordeumvulgare
ACF06726 Triticumaestivum
ACF06720 Avenasativa
ACF81642 Zeamays
ACF06721 Panicummilliaceum
CAA04440 Hordeumvulgare
CAA09976 Triticumaestivum
ACF06724 Setariaitalica
ACC59766 Oryzasativa
89
58
63
63
68
55
50
48
0.00.20.40.60.81.0
MAI N
©1996-2007 All Rights Reserved. Online J ournal of Bioinformatics . You may not store these pages in any form except for your own personal use. All other usage or distribution is illegal under international copyright treaties. Permission to use any of these pages in any other way besides the before mentioned must be gained in writing from the publisher. This article is exclusively copyrighted in its entirety to OJB publications. This article may be copied once but may not be, reproduced or re-transmitted without the express permission of the editors. This journal satisfies the refereeing requirements (DEST) for the Higher Education Research Data Collection (Australia). Linking:To link to this page or any pages linking to this page you must link directly to this page only here rather than put up your own page.
OJBTM
Online Journal of Bioinformatics©
8 (1) : 75-83, 2007
In silico Cis-regulatory Elements Analysis of Seed Storage Protein Promoters Cloned from Different
Cultivars of Wheat, Rice and Oat
Yadav D1, Singh VK1, Singh NK2
1Department of Molecular Biology and Genetic Engineering, College of Basic Sciences and Humanities G.B. pant University of Agriculture and Technology, Pantnagar (Uttarakhand) 2National Research Center on Plant Biotechnology Indian Agriculture Research Institute, New Delhi 110012
ABSTRACT
A total of 24 promoter sequences with assigned accession number EF393165 to EF393188 and representing major seed storage proteins of wheat namely High molecular weight glutenin subunit (HMW-GS), low molecular weight glutenin subunits (LMW-GS) alpha/beta gliadins, triticin along with rice glutelins and oat 12S globulins were cloned from indigenous cultivars of wheat, rice and oat and was subjected to in silico analysis using bioinformatic softwares for the presence of different cis-regulatory motifs. The phylogeny studies based on the multiple sequence alignment of these promoters revealed four distinct clusters showing major group of seed storage promoters. The
presence of additional motifs like RY repeats, ABRE, AC-11, CAAT box, LTR, UTR, CCGTCC box, G box, GARE, MBS along with the common motifs present in seed storage promoters like Prolamin-box, TATA, CAAT provides a better option for multifarious uses. Keywords: Seed storage protein promoters, Cis-regulatory Elements, In silico.
Seed Storage Protein Promoters
Accession Number
Cultivars Length
)bp(
HMW Glutenin)Triticum
aestivum(
EF396165EF396184EF396166EF396167EF396168EF396169EF396170EF396171EF396172EF396173
UP-262UP-262UP-262UP-262UP-262UP-301UP-301UP-301UP-301UP-301
402487397412385385393398392393
LMW Glutenin)Triticum
aestivum(
EF396187 HD-2329 551
/ gliadin (Triticum aestivum)
EF396174EF396175EF396177EF396178EF396176EF396182
KalyansonaKalyansona
UP-262UP-262UP-262UP-301
520564591521563548
Triticin (Triticum aestivum)
EF396181EF396183EF396185EF396186
HD-2329HD-2329HD-2329
Kalyansona
428370452343
12S Globulin ( Avena sativa)
EF396179 UPO-94 549
Glutelins ( Oryza sativa)
EF396180EF396188
PantDhan-12Pusa Basmati
562487
http://www.insilicogenomics.in/cry-bt-search.asp
CERCOSPORA LEAF SPOT DISEASE OF PIGEONPEA AND ITS MANAGEMENT