Insilico Gene Analysis

download Insilico Gene Analysis

of 34

Transcript of Insilico Gene Analysis

  • 8/8/2019 Insilico Gene Analysis

    1/34

    iabt

    In silico Gene Analysis

  • 8/8/2019 Insilico Gene Analysis

    2/34

    iabt

    Outline

    Introduction

    Alignment

    ORF searching

    3D protein modeling

    Case study

  • 8/8/2019 Insilico Gene Analysis

    3/34

    iabt

    INTRODUCTIONWhat is gene?

    What are the essential components of a gene

    Initiation codon

    Intron and exons(in eukaryotes)

    Stop codon

    Regulatory sequences

    A length of DNA which codes for a particular protein, or in certain

    cases a functional or structural RNA molecule

  • 8/8/2019 Insilico Gene Analysis

    4/34

    iabt

    INTRODUCTION .

    Essential feature of gene which are considered for in silico gene analysis

    All proteins contains 20 amino acids (one letter code)

    Stop codons are also fixed TAA, TAG and TGA

    Intron boundaries- GU-AC

    Codon usage differs from organism to organism

    All nucleotide sequences essentially contains A, T,G and C

    Initiation codon is fixed - ATG

  • 8/8/2019 Insilico Gene Analysis

    5/34

    iabt

    FILE FORMATS

    FASTA format>XM_414949 | Gallus gallus |alpha 2 globinMVLSAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYF

    GI format

    ; comment;commentXM_414949

    MVLSAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYF1

    GDE format

    %XM_414949 | Gallus gallus |alpha 2 globinMVLSAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYF

    NBRF/PIR format

    >P1; XM_414949 | Gallus gallus |alpha 2 globinMVLSAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYF

  • 8/8/2019 Insilico Gene Analysis

    6/34

    iabt

    ALIGNMENTS

    The result of a comparison of two or more gene or protein sequences in

    order to determine their degree of base or amino acid similarity

    ALIGNMENT

    Pair wise Alignment Multiple Alignment

    Local Alignment Global alignment

  • 8/8/2019 Insilico Gene Analysis

    7/34

    iabt

    >NG_000007 |chromosome 11| beta hemoglobin|Homo sapiens

    >NG_000007 |chromosome 11| beta hemoglobin|Homo sapiens

    REFERENCE SEQUENCE

    atggtgcatctgactcctgaggagaagtctgccgttactgccctgtggggcaaggtgaacgtggatgaagttggtggtgaggccctgggcaggctgctggtggtctacccttggacccagag

    gttctttgagtcctttggggatctgtccactcctgatgctgttatgggcaaccctaaggtgaaggctcatggcaagaaagtgctcggtgcctttagtgatggcctggctcacctggacaacctcaagggcacctttgccacactgagtgagctgcactgtgacaagctgcacgtggatcctgagaacttcaggctcctgggcaacgtgctggtctgtgtgctggcccatcactttggcaaagaattcaccccaccagtgcaggctgcctatcagaaagtggtggctggtgtggctaatgccctggcccacaagtatcactaa

    MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK ZVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG KEFTPPVQAAYQKVVAGVANALAHKYH

  • 8/8/2019 Insilico Gene Analysis

    8/34

    iabt

    Sequences may be nucleotide-nucleotide or amino acid-amino acid

    PAIRWISE ALIGNMENT

    Two sequences are compared at a time

    It may be gaped/ un-gaped alignment

    Ex : BLAST and FASTA

    Two algorithms Smith- Waterman algorithm (local alignment)

    Needleman-Wunsch algorithm (global alignment)

  • 8/8/2019 Insilico Gene Analysis

    9/34

    iabt

    BLAST (Basic Local Alignment Search Tool)

    Pair wise local alignment

    Developed by Stephen Altschul, Warren Gish, David Lipman

    BLAST searches for short matches of a fixed length W between the

    query and sequences in the database

    Stages in search

    BLAST performs an ungapped alignment between the query and databasesequence on either sides , if they share a common word.

    BLAST performs a gapped alignment between the query sequence and the

    database sequence

  • 8/8/2019 Insilico Gene Analysis

    10/34

    iabt

    BLAST .

    It consider whole database as one

    sequence and align the query

    sequence

    high-scoring segment pairs

  • 8/8/2019 Insilico Gene Analysis

    11/34

    iabt

    BLAST ..

    Low complexity region

  • 8/8/2019 Insilico Gene Analysis

    12/34

    iabt

    FASTA

    Pairwise local alignment

    Developed by David J. Lipman and William R. Pearson in 1985

    It looks for identically matching word length called ktup

    It identifies single high scoring region

    It matches individual sequence of database with query sequence

  • 8/8/2019 Insilico Gene Analysis

    13/34

    iabt

    FASTA .

    It aligns the individual database

    sequence with Query sequence

    E value is different from BLAST

    E= Np

  • 8/8/2019 Insilico Gene Analysis

    14/34

    iabt

    PROTEIN MATRICES

    C 1

    S 0 1

    T 0 0 1

    P 0 0 0 1

    A 0 0 0 0 1

    G 0 0 0 0 0 1N 0 0 0 0 0 0 1

    D 0 0 0 0 0 0 0 1

    E 0 0 0 0 0 0 0 0 1

    Q 0 0 0 0 0 0 0 0 0 1

    H 0 0 0 0 0 0 0 0 0 0 1

    R 0 0 0 0 0 0 0 0 0 0 0 1

    K 0 0 0 0 0 0 0 0 0 0 0 0 1

    M 0 0 0 0 0 0 0 0 0 0 0 0 0 1

    I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

    L 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

    F 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

    Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

    W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

    C S T P A G N D E Q H R K M I L V F Y W

    C 12

    S 0 2

    T -2 1 3

    P -1 1 0 6

    A -2 1 1 1 2

    G -3 1 0 -1 1 5

    N -4 1 0 -1 0 0 2

    D -5 0 0 -1 0 1 2 4

    E -5 0 0 -1 0 0 1 3 4

    Q -5 -1 -1 0 0 -1 1 2 2 4

    H -3 -1 -1 0 -1 -2 2 1 4 3 6

    R -4 0 -1 0 -2 -3 0 -1 -1 1 2 6

    K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5

    M -5 -1 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6

    I -3 0 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5

    L -6 -2 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6V -2 0 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4

    F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9

    Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10

    W -8 -5 5 -6 -6 -7 4 7 7 5 3 2 -3 -4 -5 -2 -6 0 0 17

    C S T P A G N D E Q H R K M I L V F Y W

    Associated substitution matrix PAM250 matrix

    S

    UB

    J

    E

    C

    T

    QUERY

  • 8/8/2019 Insilico Gene Analysis

    15/34

    iabt

    GAPS AND PENALTIES

    Constant penalty : usually it is 1

    Proportional penalty : depends on length of the gap

    Affine : gap openig penalty + gap extension penalty

    S = actual alignment score from matrix gap penalty

  • 8/8/2019 Insilico Gene Analysis

    16/34

    iabt

    RESTRICTION SITES

  • 8/8/2019 Insilico Gene Analysis

    17/34

    iabt

    MULTIPLE ALIGNMENT

    More than two sequences

    Gaps are frequent

    Always global alignment

  • 8/8/2019 Insilico Gene Analysis

    18/34

    iabt

    WHY DO WE NEED MULTIPLE ALIGNMENT ???

    Homology searching between the sequence

    To characterize the protein families-conserved domains, promoters etc,

    Designing special probes, degenerated primers etc,..

    Required in Protein modeling

    Helps in prediction of secondary and tertiary structure of new sequence

    Input for constructing phylogenetic tree

  • 8/8/2019 Insilico Gene Analysis

    19/34

    iabt

    MULTIPLE ALIGNMENT ALGORITHMS

    Hierarchical method (Clustal W) Divide and conquer method

    AB

    C

    D

    E

  • 8/8/2019 Insilico Gene Analysis

    20/34

    iabt

    MULTIPLE ALIGNMENT .

    Gaps

    Conserved

    region

  • 8/8/2019 Insilico Gene Analysis

    21/34

    iabt

    CONSERVED DOMAIN SEARCH

    Conserved domain

    Some amount of sequence (20 %) missing in blastat C terminal end

  • 8/8/2019 Insilico Gene Analysis

    22/34

    iabt

    SOFTWARE AVAILABLE

    Clustal W / X

    Bioedit

    Q align

    CLC free work bench

    Gene tool

    Vector NTI

    NCBI server

    EMBL server

  • 8/8/2019 Insilico Gene Analysis

    23/34

    iabt

    PHYLOGENETIC ANALYSIS

    Sequence should be correct and originated from specified source

    Sequences should be homologous

    Each position in a alignment should be homologous with every other

    in that alignment

    No contamination of sequence i.e., nuclear and organelle genomes

  • 8/8/2019 Insilico Gene Analysis

    24/34

    iabt

    PHYLOGENETIC ANALYSIS.

    Distance method

    Tree building methods

    Character based method

    UPGMA NJ

    Maximum parsimony method Maximum likelihood method

  • 8/8/2019 Insilico Gene Analysis

    25/34

    iabt

    SP

    G

    At

    LAo

    H

    L

    S

    P

    G

    AoAt

    H

    NEIGHBOUR JOINING METHOD

  • 8/8/2019 Insilico Gene Analysis

    26/34

    iabt

    ORF SEARCHING

    Molecular biology background

    ORF contains following features

    Initiation codon

    Stop codon

    Intron boundaries

    Defined codon usage

  • 8/8/2019 Insilico Gene Analysis

    27/34

    iabt

    ORF FINDING ALGORITHMS

    Content-based method

    Site based method

    Comparative method

  • 8/8/2019 Insilico Gene Analysis

    28/34

    iabt

    ORF FINDING ALGORITHMS

    Text information

    Graphical view

    -Hemoglobin gene

  • 8/8/2019 Insilico Gene Analysis

    29/34

    iabt

    GENSCAN

    Gene tool

    CLC free work bench

    SOFTWARE AVAILABLE

  • 8/8/2019 Insilico Gene Analysis

    30/34

    iabt

    PROTEIN THREE DIMENSIONAL MODELING

    Comparative modeling

    Fold recognition

    Ab initio prediction

  • 8/8/2019 Insilico Gene Analysis

    31/34

    iabt

    COMPARATIVE PROTEIN MODELING

    start

    Identify related structure

    Select Template

    Evaluate the model

    Align target sequence withtemplate structure

    Build model for target

    ModelOK?

    YESend

    NO

  • 8/8/2019 Insilico Gene Analysis

    32/34

    iabt

    Bovine hemoglobin Human hemoglobin Beta chain

    COMPARATIVE PROTEIN MODELING

  • 8/8/2019 Insilico Gene Analysis

    33/34

    iabt

    SOFTWARE AVAILABLE

    Cn3D

    Bioediter

    Deep view / swiss-pdb viewer

  • 8/8/2019 Insilico Gene Analysis

    34/34

    iabt

    OTHER METHODS

    Also called as protein threading

    It uses the library of models

    Based on library information model is constructed

    Fold recognition

    Ab initio prediction

    Uses the the thermodynamics and quantum mechanism