Introduction to Bioinformatics

V. K. SinghInformation officerCentre for BioinformaticsBanaras Hindu University

Introduction to Bioinformatics

What is Bioinformatics

“The analysis of biological information using computers and statistical techniques; the science of developing and utilizing computer databases and algorithms to accelerate and enhance biological research”

www.niehs.nih.gov/dert/trc/glossary.htm

1: Introduction

http://www.niehs.nih.gov/dert/trc/glossary.htm

What Bioinformatics can offer to biologists?

1: Introduction

1: Introduction

Computational biology – Insilico genome revolution at the turn of the century.

•Life was classified as

plants and animals

•When Bacteria were discoveredthey were initially classified as plants.

•Ernst Haeckel (1866) placed all unicellular organisms in a kingdom called Protista, separated from Plantae and Animalia.

In the very beginning

1: Introduction

1: Introduction

Thus, life were classified to 5 kingdoms:

When electron microscopes were developed, it was found that Protista in fact include both cells with and without nucleus. Also, fungi were found to differ from plants, since they are heterotrophs (they do not synthesize their food).

LIFE

FungiPlants Animals ProtistsProcaryotes

1: Introduction

Later, plants, animals, protists and fungi were collectively called the Eucarya domain, and the procaryotes were shifted from a kingdom to be a Bacteria domain.

Domains EucaryaBacteria

FungiPlants Animals ProtistsKingdoms

Even later, a new Domain was discovered…

1: Introduction

rRNA was sequenced from a great number of organisms to study phylogeny

1: Introduction

Revolutionizing the Classification of Life

1: Introduction

The rRNA phylogenetic tree

From sequence analysis only, it was thus established that life is divided into 3:BacteriaArchaeaEucarya

1: Introduction

Gregor Mendellaws of inheritance,“gene”1866

Watson and Crick

DNA Discovery 1953

Genome

Project 2003

1: Introduction

Sequencing of Genomes

Genomic Sequencing – shotgun sequencing

Sequencing is usually ~700 bp in a single run.

How can we sequence a genome?

1: Introduction

Genomic Sequencing – Walking.

1.Design a primer2.Sequence.3.Design a new primer4.Sequence5.…

One has to design new primers every time. To do so, one has to wait for the sequencing results

1: Introduction

GAGGAGACGAACACCCGTATACAGTCGACG

ACCCCGAGGAGACGAACACCCGTATACAGTCGACGTTTATATATA

GTATACAGTCGACGTTTATATATA

ACCCCGAGGAGACGA

Genomic Sequencing – shotgun sequencing

1. Break DNA to small pieces2. Sequence each piece3. Assemble

1: Introduction

GAGGAGACGAACACCCGTATACAGTCGACG

ACCCCGAGGAGACGA ? GTATACAGTCGACGTTTATATATA

GTATACAGTCGACGTTTATATATA

ACCCCGAGGAGACGA

Shotgun sequencing – why isn’t it a trivial task?

1. By chance, some parts are not sequenced even once!!!

1: Introduction

Shotgun sequencing – why isn’t it a trivial task?

2. Some pieces do not align because of sequencing errors

GAGGTGAGGAACACCCGTATACAGTCGACG

ACCCCGAGG?GA?GAACACCCGTATACAGTCGACGTTTATATATA

ACCCCGAGGAGACGA

1: Introduction

Shotgun sequencing – why not a trivial task?

3. Repetitive sequences –satellites DNA.

GGGGGGGGGGGGGGGGGGGGGGGGGGGG

ACCCCGGGGGGGGGGGGG????GGGGGGGGGGGGGA

GGGGGGGGGGGGGGGGGGGGGGA

ACCCCGGGGG

1: Introduction

A section of the genome that could be reliably assembled.

A contig

1: Introduction

23

BIOINFORMATICS DATABASES

24

What’s in a database?• Sequences – genes, proteins, etc…• Full genomes• Expression data• Structures• Annotation – information about genes/proteins:

- function- cellular location- chromosomal location- introns/exons- phenotypes, diseases

• Publications

25

NCBI and Entrez

• One of the most largest and comprehensive databases belonging to the NIH (national institute of health. The primary Federal agency for conducting and supporting medical research in the USA)

• Entrez is the search engine of NCBI• Search for :

genes, proteins, genomes, structures, diseases, publications, and more

http://www.ncbi.nlm.nih.gov

32

PubMed: NCBI’s database of biomedical articles

Yang X, Kurteva S, Ren X, Lee S, Sodroski J. “Subunit stoichiometry Yang X, Kurteva S, Ren X, Lee S, Sodroski J. “Subunit stoichiometry of human immunodeficiency virus type 1 envelope glycoprotein trimers of human immunodeficiency virus type 1 envelope glycoprotein trimers during virus entry into host cells “, J Virol. 2006 May;80(9):4388-95.during virus entry into host cells “, J Virol. 2006 May;80(9):4388-95.

33

Use fields!Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA]

For the full list of field tags: go to help -> Search Field Descriptions and Tags

34

Example

• Retrieve all publications in which the first author is: Davidovich C and the last author is: Yonath A

35

Using limits

Retrieve the publications of Yonath A, in the journals: Nature and Proc Natl Acad Sci U S A., in the last 5 years

36

Searching NCBI for the protein human CD4

Search demonstrationSearch demonstration

38

Using field descriptions, qualifiers, and boolean operators

• Cd4[GENE] AND human[ORGN] Or Cd4[gene name] AND human[organism]

• List of field codes: http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers

– Boolean Operators:ANDORNOT

Note: do not use the field Protein name [PROT], only GENE!

39

This time we directly search in the protein databaseThis time we directly search in the protein database

40

RefSeq• Subcollection of NCBI databases with only non-

redundant, highly annotated entries (genomic DNA, transcript (RNA), and protein products)

42An explanation on GenBank records

4343

Fasta format

> gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens] MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCVRCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI

Save accession numbers for future use (makes searching quicker):RefSeq accession number: NP_000607.1

header

ID/accession description

sequence

4444

Downloading

Homology Search Using

Sequence Alignment

|| || ||||| ||| || || |||||||||||||||||||MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE…

ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCCCTGGCTCACAAGTACCATTGA

MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…

Before we begin…

What is sequence alignment?

Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.

MVNLTSDEKTAVLALWNKVDVEDCGGE|| || ||||| ||| || || ||MVHLTPEEKTAVNALWGKVNVDAVGGE

Why sequence alignment?

Predict characteristics of a protein – use the structure or function information on known proteins with similar sequences available in databases in order to predict the structure or function of an unknown protein

Assumptions: similar sequences produce similar proteins

Local vs. Global• Global alignment – finds the best

alignment across the whole two sequences.

• Local alignment – finds regions of high similarity in parts of the sequences.

ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ

ADLG CDRYFQ|||| |||| |ADLG CDRYYQ

Global alignment:

forces alignment in

regions which differ

Local alignment

concentrates on regions of high similarity

In the course of evolution, the sequences changed from the ancestral sequence by random mutations

Three types of changes:1. Insertion - an insertion of a letter or several letters to the sequence.

AAGA AAGTA

Sequence evolution

AAGAAGAA

InsertionInsertion


Three types of changes :1. Insertion - an insertion of a letter or several letters to the sequence.

AAGA AAGTA2. Deletion – a deletion of a letter (or more) from the sequence.

AAGA AGA

Sequence evolution

AA AGAG

DeletionDeletion

AA


Three types of mutations:1. Insertion - an insertion of a letter or several letters to the sequence.

AAGA AAGTA2. Deletion - deleting a letter (or more) from the sequence.

AAGA AGA3. Substitution – a replacement of one (or more) sequence letter by

another AAGA AACA

Evolutionary changes in sequences

AAAA AA

SubstitutionSubstitution

GGCCInsertionInsertion + + DeletionDeletion IndelIndel

Sequence alignment

AAGCTGAATTCGAAAGGCTCATTTCTGA

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

One possible alignment:

This alignment includes:

2 mismatches 4 indels (gap)

10 perfect matches

Choosing an alignment:

• Many different alignments are possible:

AAGCTGAATTCGAAAGGCTCATTTCTGA

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

Which alignment is better?


Scoring an alignment:example - naïve scoring system:

• Match: +1• Mismatch: -2• Indel: -1


Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2 Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

Higher score Better alignment

Scoring system:

• Different scoring systems can produce different optimal alignments

• Scoring systems implicitly represent a particular theory of similarity/dissimilarity between sequence characters: evolution based, physico-chemical properties based – Some mismatches are more plausible

• Transition vs. Transversion • LysArg ≠ LysCys

– Gap extension Vs. Gap opening

Substitutions Matrices

• Nucleic acids:– Transition-transversion

• Amino acids:– Evolution (empirical data) based: (PAM, BLOSUM)– Physico-chemical properties based (Grantham,

McLachlan)

Web server for pairwise alignment

BLAST 2 sequences (bl2Seq) at NCBI

Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool) engine for local alignment

• Does not use an exact algorithm but a heuristic

Back to NCBI

BLAST – bl2seq

blastnblastn – nucleotide – nucleotide

blastpblastp – protein – protein

Bl2Seq - query

Bl2seq results

Bl2seq results

MatchMatch DissimilarityDissimilarity GapsGaps SimilaritySimilarity Low Low

complexitycomplexity

Bl2seq results:

• Bits score – A score for the alignment according to the number of similarities, identities, etc.

• Expected-score (E-value) –The number of alignments with the same score one can “expect” to see by chance when searching a database of a particular size. The closer the e-value approaches zero, the greater the confidence that the hit is real

BLAST – programs

Query: DNA Protein

Database: DNA Protein

BLAST – Blastp

Blastp - results

Blastp – results (cont’)

Blastp – acquiring sequences

blastp – acquiring sequences (cont’)

Multiple Sequence Alignment (MSA)

andPhylogeny

One of the options to get multiple sequence Fasta file

Input: multiple sequence Fasta file>gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein [Homo sapiens]MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECLISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQLQGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS

>gi|114051746|ref|NP_001040585.1| protease, serine, 2 [Macaca mulatta]MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQVRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEALISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQLQGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS

>gi|6755891|ref|NP_035775.1| mesotrypsin [Mus musculus]MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCLISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNRELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN

>gi|6981422|ref|NP_036861.1| protease, serine, 2 [Rattus norvegicus]MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN

>gi|27819626|ref|NP_777115.1| pancreatic anionic trypsinogen [Bos taurus]MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQVRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL. . .

Step1: Load the sequences

Sequences and conservation view

Step2: Perform Alignment

Sequences and conservation view

Step 3: Create tree

Step 4: NJPlot

• We need some statistical way to estimate the confidence in the tree topology

• But we don’t know anything about the tree topology distribution or parameters

• The only data source we have is our data (MSA)

• So, we must rely on our own resources: “pull up by your own bootstraps”

How robust is our tree?

Bootstrap

1. Resample K positions n times

12345 K1 : ATCTG…A 2 : ATCTG…C3 : ACTTA…C N : ACCTA…T

11244 K1 : AATTT…T2 : AATTT…G3 : AACTT…TN : AACTT…T

47789…K1 : TTTAT…T2 : TAACC…G3 : TAACC…TN : TGGGA…T

15578… K1 : AGGTA…T2 : AGGAC…G3 : AAAAC…AN : AAAGG…C

Bootstrap2. Reconstruct a tree from each data set using the same method used for reconstructing the original tree

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3Sp4

11244 K1 : AATTT…T2 : AATTT…G3 : AACTT…TN : AACTT…T

47789…K1 : TTTAT…T2 : TAACC…G3 : TAACC…TN : TGGGA…T

15578… K1 : AGGTA…T2 : AGGAC…G3 : AAAAC…AN : AAAGG…C

Bootstrap3. For each node in our original tree, we count the number of times it appeared in the bootstrap analysis

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3

Sp4

67%100%

Step 3.5 - Bootstrap

Bootstrap values on NJPlot

Note:ClustalX saves trees as .ph filetrees with bootstrap are saved as .phb

You might have to reopen the tree…

Protein information Resource

• Swissprot• PDB

91

Swissprot

• A protein sequence database which strives to provide a high level of annotation regarding:* the function of a protein* domains structure* post-translational modifications* variants

• One entry for each protein

http://www.expasy.ch/sprot

93

PDB: Protein Data Bank

• Main database of 3D structures of macromolecules

• Includes ~61,000 entries (proteins, nucleic acids, complex assemblies)

• Is highly redundant

http://www.rcsb.org

94

Human CD4 in complex with HIV gp120

gp120

CD4

PDB ID 1G9M

What do bioinformaticians study?

• Bioinformatics today is part of almost every molecular biological research.

• Just a few examples…

1: Introduction

Example 1

• Compare proteins with similar sequences (for instance –kinases) and understand what the similarities and differences mean

1: Introduction

Example 2

• Look at the genome and predict where genes are (promoters; transcription binding sites; introns; exons)

1: Introduction

• Predict the 3-dimensional structure of a protein from its primary sequence

Example 3

Ab-initio prediction – extremely difficult!

1: Introduction

• Correlate between gene expression and disease

Example 4

A gene chip – quantifying gene expression in different tissues under different conditions

May be used for personalized medicine

1: Introduction

Role of Centre for Bioinformatics in School of Biotechnology, BHU

MAI N

©1996-2007 All Rights Reserved. Online J ournal of Bioinformatics . You may not store these pages in any form except for your own personal use. All other usage or distribution is illegal under international copyright treaties. Permission to use any of these pages in any other way besides the before mentioned must be gained in writing from the publisher. This article is exclusively copyrighted in its entirety to OJB publications. This article may be copied once but may not be, reproduced or re-transmitted without the express permission of the editors. This journal satisfies the refereeing requirements (DEST) for the Higher Education Research Data Collection (Australia). Linking:To link to this page or any pages linking to this page you must link directly to this page only here rather than put up your own page.

OJBTM

Online Journal of Bioinformatics©

8 (1) : 75-83, 2007

In silico Cis-regulatory Elements Analysis of Seed Storage Protein Promoters Cloned from Different

Cultivars of Wheat, Rice and Oat

Yadav D1, Singh VK1, Singh NK2

1Department of Molecular Biology and Genetic Engineering, College of Basic Sciences and Humanities G.B. pant University of Agriculture and Technology, Pantnagar (Uttarakhand) 2National Research Center on Plant Biotechnology Indian Agriculture Research Institute, New Delhi 110012

ABSTRACT

A total of 24 promoter sequences with assigned accession number EF393165 to EF393188 and representing major seed storage proteins of wheat namely High molecular weight glutenin subunit (HMW-GS), low molecular weight glutenin subunits (LMW-GS) alpha/beta gliadins, triticin along with rice glutelins and oat 12S globulins were cloned from indigenous cultivars of wheat, rice and oat and was subjected to in silico analysis using bioinformatic softwares for the presence of different cis-regulatory motifs. The phylogeny studies based on the multiple sequence alignment of these promoters revealed four distinct clusters showing major group of seed storage promoters. The

presence of additional motifs like RY repeats, ABRE, AC-11, CAAT box, LTR, UTR, CCGTCC box, G box, GARE, MBS along with the common motifs present in seed storage promoters like Prolamin-box, TATA, CAAT provides a better option for multifarious uses. Keywords: Seed storage protein promoters, Cis-regulatory Elements, In silico.

Seed Storage Protein Promoters

Accession Number

Cultivars Length(bp)

HMW Glutenin(Triticum aestivum)

EF396165EF396184EF396166EF396167EF396168EF396169EF396170EF396171EF396172EF396173

UP-262UP-262UP-262UP-262UP-262UP-301UP-301UP-301UP-301UP-301

402487397412385385393398392393

LMW Glutenin(Triticum aestivum)

EF396187 HD-2329 551

/ gliadin (Triticum aestivum)

EF396174EF396175EF396177EF396178EF396176EF396182

KalyansonaKalyansonaUP-262UP-262UP-262UP-301

520564591521563548

Triticin (Triticum aestivum)

EF396181EF396183EF396185EF396186

HD-2329HD-2329HD-2329Kalyansona

428370452343

12S Globulin ( Avena sativa)

EF396179 UPO-94 549

Glutelins ( Oryza sativa)

EF396180EF396188

PantDhan-12Pusa Basmati

562487

200 bp

172 bp

Motif-1Motif-2 Motif-3

CCC CZinc- finger

AAY28423 Piceaabies

EAY88711 Oryzasativaindica

ABI16029 Glycinemax

XP 001751505 Physcomitrella

CAC85949 Hordeumvulgare

EAY73401 Oryzasativaindica

CAO15000 Vitisvinifera

ABN08462 Medicagotruncatula

NP 001042660 Oryzasativajaponi

XP 001759349 Physcomitrellapat

ACF80167 Zeamays

EAZ41131 Oryzasativa

ACF06723 Paspalumscrobiculatum

ACC59765 Eleusinecoracana

CAN79859 Vitisvinifera

NP 001060673 Oryzasativajaponi

ACF06725 Echinochloafrumentace

EAZ05181 Oryzasativaindica

ACF06718 Sorghumbicolor

CAO64539 Vitisvinifera

ACF06722 Panicumantidotale

ABQ42348 Glycinemax

ACF06719 Hordeumvulgare

ACF06726 Triticumaestivum

ACF06720 Avenasativa

ACF81642 Zeamays

ACF06721 Panicummilliaceum

CAA04440 Hordeumvulgare

CAA09976 Triticumaestivum

ACF06724 Setariaitalica

ACC59766 Oryzasativa

89

58

63

63

68

55

50

48

0.00.20.40.60.81.0

MAI N

©1996-2007 All Rights Reserved. Online J ournal of Bioinformatics . You may not store these pages in any form except for your own personal use. All other usage or distribution is illegal under international copyright treaties. Permission to use any of these pages in any other way besides the before mentioned must be gained in writing from the publisher. This article is exclusively copyrighted in its entirety to OJB publications. This article may be copied once but may not be, reproduced or re-transmitted without the express permission of the editors. This journal satisfies the refereeing requirements (DEST) for the Higher Education Research Data Collection (Australia). Linking:To link to this page or any pages linking to this page you must link directly to this page only here rather than put up your own page.

OJBTM

Online Journal of Bioinformatics©

8 (1) : 75-83, 2007

In silico Cis-regulatory Elements Analysis of Seed Storage Protein Promoters Cloned from Different

Cultivars of Wheat, Rice and Oat

Yadav D1, Singh VK1, Singh NK2

1Department of Molecular Biology and Genetic Engineering, College of Basic Sciences and Humanities G.B. pant University of Agriculture and Technology, Pantnagar (Uttarakhand) 2National Research Center on Plant Biotechnology Indian Agriculture Research Institute, New Delhi 110012

ABSTRACT

A total of 24 promoter sequences with assigned accession number EF393165 to EF393188 and representing major seed storage proteins of wheat namely High molecular weight glutenin subunit (HMW-GS), low molecular weight glutenin subunits (LMW-GS) alpha/beta gliadins, triticin along with rice glutelins and oat 12S globulins were cloned from indigenous cultivars of wheat, rice and oat and was subjected to in silico analysis using bioinformatic softwares for the presence of different cis-regulatory motifs. The phylogeny studies based on the multiple sequence alignment of these promoters revealed four distinct clusters showing major group of seed storage promoters. The

presence of additional motifs like RY repeats, ABRE, AC-11, CAAT box, LTR, UTR, CCGTCC box, G box, GARE, MBS along with the common motifs present in seed storage promoters like Prolamin-box, TATA, CAAT provides a better option for multifarious uses. Keywords: Seed storage protein promoters, Cis-regulatory Elements, In silico.

Seed Storage Protein Promoters

Accession Number

Cultivars Length

)bp(

HMW Glutenin)Triticum

aestivum(

EF396165EF396184EF396166EF396167EF396168EF396169EF396170EF396171EF396172EF396173

UP-262UP-262UP-262UP-262UP-262UP-301UP-301UP-301UP-301UP-301

402487397412385385393398392393

LMW Glutenin)Triticum

aestivum(

EF396187 HD-2329 551

/ gliadin (Triticum aestivum)

EF396174EF396175EF396177EF396178EF396176EF396182

KalyansonaKalyansona

UP-262UP-262UP-262UP-301

520564591521563548

Triticin (Triticum aestivum)

EF396181EF396183EF396185EF396186

HD-2329HD-2329HD-2329

Kalyansona

428370452343

12S Globulin ( Avena sativa)

EF396179 UPO-94 549

Glutelins ( Oryza sativa)

EF396180EF396188

PantDhan-12Pusa Basmati

562487

http://www.insilicogenomics.in/cry-bt-search.asp

CERCOSPORA LEAF SPOT DISEASE OF PIGEONPEA AND ITS MANAGEMENT

Introduction to Bioinformatics

Education

Transcript of Introduction to Bioinformatics