dbase.ppt

download dbase.ppt

of 36

Transcript of dbase.ppt

  • 8/11/2019 dbase.ppt

    1/36

    Introduction to bioinformatics Sylvia B. Nagl

  • 8/11/2019 dbase.ppt

    2/36

    What is bioinformatics?

    an emerging interdisciplinary research area

    deals with the computational managementand analysis of biological information: genes,genomes, proteins, cells, ecological systems,medical information, robots, artificialintelligence...

  • 8/11/2019 dbase.ppt

    3/36

    Relationships between

    sequence 3D structure protein functions

    Properties and evolution of genes, genomes,proteins, metabolic pathways in cells

    Use of this knowledge for prediction, modelling, and

    design

    The Core of Bioinformatics to date

    TDQAAFDTNIVTLTRFVM

    EQGRKARGTGEMTQLLNS

    LCTAVKAISTAVRKAGIA

    HLYGIAGSTNVTGDQVKK

    LDVLSNDLVINVLKSSFA

    TCVLVTEEDKNAIIVEPE

    KRGKYVVCFDPLDGSSNI

    DCLVSIGTIFGIYRKNSTDEPSEKDALQPGRNLVAA

    GYALYGSATML V

  • 8/11/2019 dbase.ppt

    4/36

    The holy grail of bioinformatics

    GCTCCTCACTGTCTGTGTTTATTCTTTTAGCTTCTTCAGATCTTTTAGTCTGAGGAAGCCTGGCATGTGCAAATGAAGTTAACCTAA...

    > 500, 000 genessequenced to date

    Expected number ofunique protein

    structures:

    ~ 700-1, 000

  • 8/11/2019 dbase.ppt

    5/36

    Basic concepts

    conceptual foundations of bioinformatics: evolution

    protein foldingprotein function

    bioinformatics builds mathematical modelsof these processes -to infer relationships between componentsof complex biological systems

  • 8/11/2019 dbase.ppt

    6/36

    Information processing in cells

    coding regions

    regulatory

    sites

    nucleic acids

    transcripts

    proteins

    One-to-many mappings!

    Context-dependence!

  • 8/11/2019 dbase.ppt

    7/36

    Global cell state

    Genome activationpatterns : transcriptomics

    Protein population :

    proteomics

    Organisation:

    tissue imaging EM X-ray, NMR

    cells

    molecular complexes

    Global approaches: Toward a new Systems Biology

    How does the spatial andtemporal organisation of

    living matter give rise tobiological processes?

    Genome

  • 8/11/2019 dbase.ppt

    8/36

    Living cell

    Virtual cell

    Perturbation Dynamic response

    Biological knowledge(computerised)

    Sequence information

    Structural information

    Basic principles

    Practicalapplications

    Global approaches: Toward a new Systems Biology

    Bioinformatics

    Mathematicalmodelling

    Simulation

  • 8/11/2019 dbase.ppt

    9/36

  • 8/11/2019 dbase.ppt

    10/36

    Bioinformatics in context

    Genomics

    Molecularevolution

    Biophysics Molecularbiology

    Ethical, legal,and social

    implications

    Bioinformatics

    Mathematics/computerscience

  • 8/11/2019 dbase.ppt

    11/36

    Current challenges to users

    Potential hurdles:Methods are in flux and not fully developed-scattered and heterogeneous resources

    Remedies: Web resourcesnavigation guidesintegration of tools and databanks

    http://www.biochem.ucl.ac.uk/~nagl/bioinformatics.html

    http://www.biochem.ucl.ac.uk/~nagl/bioinformatics.htmlhttp://www.biochem.ucl.ac.uk/~nagl/bioinformatics.html
  • 8/11/2019 dbase.ppt

    12/36

    Sequence homology search of the

    genome of Pla smo d iu m

    falc iparum

    Target identification for antimalerialdrugs

  • 8/11/2019 dbase.ppt

    13/36

    The search for new antimalarialdrugs

    Malaria is one of the leading causes of morbidityand mortality in the tropics.

    300 to 500 million estimated clinical cases and 1.5million to 2.7 million deaths per year.

    Nearly all fatal cases are caused by Plasmod iumfalciparum.

    The parasite's resistance to conventionalantimalarial drugs such as chloroquine is growingat an alarming rate.

  • 8/11/2019 dbase.ppt

    14/36

    P. falc ip aru m has a plastidlike organelle, called theapicoplast, acquired by endosymbiosis of an alga.

    Self-replicating, maternally inherited (35kb, circular DNA). Comparative genome analysis : Search for orthologs.

    Apicoplast contains enzymes found in plant and bacterial,but not animal metabolic pathways. Potential target for antimalerial drugs:

    DOXP reductoisomerase

    Jomaa et al. (1999)

  • 8/11/2019 dbase.ppt

    15/36

    Jomaa et al . (1999) Science 285: 1573-1576:

  • 8/11/2019 dbase.ppt

    16/36

    Biological databases

  • 8/11/2019 dbase.ppt

    17/36

    In 1995, the number of genes in the database started to exceedthe number of papers on molecular biology and genetics in the

    literature!

    (Boguski, 1999 )

    The challenge

  • 8/11/2019 dbase.ppt

    18/36

    Data types primary data

    secondary data

    tertiary data

    sequence

    DNA

    amino acid

    AATGCGTATAGGC

    DMPVERILEALAVE

    primary database

    secondaryprotein structure motifs: regular

    expressions, blocks,

    profiles, fingerprintse. g., alpha-helices, beta-strands

    secondary db

    domains, folding units

    tertiary proteinstructure

    tertiary db

    atomic co-ordinates

  • 8/11/2019 dbase.ppt

    19/36

    Primary biological databases

    Nucleic aci d

    EMBLGenBankDDBJ (DNA

    Data Bank of Japan)

    Protein

    PIR

    MIPS

    SWISS-PROTTrEMBL

    NRL-3D

  • 8/11/2019 dbase.ppt

    20/36

    International nucleotide data banks

    EMBL

    Europe

    EMBL

    EBI

    GenBank

    USA NLM

    NCBI

    DDBJ

    Japan NIG

    CIB

    International

    Advisory Meeting

    Collaborative Meeting

    TrEMBL NRDB

  • 8/11/2019 dbase.ppt

    21/36

    GenBank file format

  • 8/11/2019 dbase.ppt

    22/36

    GenBank file format

  • 8/11/2019 dbase.ppt

    23/36

    Swiss-Prot

  • 8/11/2019 dbase.ppt

    24/36

    SWISS-PROT file format

  • 8/11/2019 dbase.ppt

    25/36

    SWISS-PROT file format

  • 8/11/2019 dbase.ppt

    26/36

    SWISS-PROT file format

  • 8/11/2019 dbase.ppt

    27/36

    SWISS-PROT file format

  • 8/11/2019 dbase.ppt

    28/36

    Other primary protein databases

    TrEMBL (translated EMBL) in SWISS-PROT formatrapid access to sequence data from genome projectscomputer-annotated supplement to SWISS-PROT

    translations of all coding sequences (CDS) in EMBL

    SP-TrEMBL

    REM-TrEMBL: immunoglobulins, T-cell receptors, shortfragments, synthetic and patented sequences

  • 8/11/2019 dbase.ppt

    29/36

    Other primary protein databases

    The Protein Information Resource (PIR)

    integrated system of protein sequence databasesand derived related databases, e. g., alignmentdatabases

    rapid searching, comparison, and pattern matching ofprotein sequences

    retrieval of descriptive, bibliographic, feature, andconcurrent cross-reference information

    aims to be comprehensive and consistentlyannotated

  • 8/11/2019 dbase.ppt

    30/36

    PIR: related databases

    NRL-3D Sequence-Structure Database

    produced by PIR from sequence and annotationinformation extracted from three-dimensionalstructures in the Protein Databank (PDB)

    allows keyword and similarity searches

  • 8/11/2019 dbase.ppt

    31/36

    PIR: related databases

    PATCHX integrated with PIR

    a non-redundant database of protein sequencesproduced by MIPS, the European branch of PIR-International

    The PIR Protein Sequence Database and PATCHX

    together provide the most complete collection ofprotein sequence data currently available in thepublic domain.

  • 8/11/2019 dbase.ppt

    32/36

    Composite protein sequence dbs

    NRDB OWL MIPSX(PIR+PATCHX) SP+TrEMBL PIR PIR PIR TrEMBL

    SP SP SP SP

    PDB GenBank MIPSOwn

    GenPept NRL-3D NRL-3D

    MIPSH

    PIRMOD

    MIPSTrn

    EMTrans

    GBTrans

    Kabat

    PseqIP

  • 8/11/2019 dbase.ppt

    33/36

    OWL composite database

    OWL only released every 6-8weeks

    By accession number

    By database code

    By text

    By sequence

    By title

    By author

    By query language

    By regular expressionDirect OWL access:

    OWL Blast server

  • 8/11/2019 dbase.ppt

    34/36

    Two other useful sites

    INFOBIOGEN-The Public Catalog of Databases

    http://www.infobiogen.fr/services/dbcat/

    KEGG-Kyoto Encyclopedia of Genes and Genomes

    http://www.genome.ad.jp/kegg/Kyoto Encyclopedia of Genes and Genomes (KEGG) is an effort tocomputerize current knowledge of molecular and cellular biology in

    terms of the information pathways that consist of interacting moleculesor genes and to provide links from the gene catalogs produced bygenome sequencing projects.

  • 8/11/2019 dbase.ppt

    35/36

    Sequence Retrieval System (SRS)

    Database browser that allowsusers to

    retrieve

    link

    access

    entries from all interconnectedresources.

    Users can formulate queriesacross a range of differentdatabase types.

  • 8/11/2019 dbase.ppt

    36/36

    Guide to Protein Databases:http://www.biochem.ucl.ac.uk/~robert/bioinf/lecture1/index.html http://www.biochem.ucl.ac.uk/~robert/bioinf/lecture2/index.html

    With thanks to Dr Roman Laskowski.

    http://www.biochem.ucl.ac.uk/~robert/bioinf/lecture1/index.htmlhttp://www.biochem.ucl.ac.uk/~robert/bioinf/lecture1/index.htmlhttp://www.biochem.ucl.ac.uk/~robert/bioinf/lecture1/index.htmlhttp://www.biochem.ucl.ac.uk/~robert/bioinf/lecture1/index.html