genomics file

download genomics file

of 43

Transcript of genomics file

  • 8/7/2019 genomics file

    1/43

    GENOMICS FILE

    SUBMITTED TO:- SUBMITTED BY:-

    MS INDU GAUR K.PUNIT PUSHKAR

    IMT/07/8037

    SECTION S

  • 8/7/2019 genomics file

    2/43

  • 8/7/2019 genomics file

    3/43

    EXPERIMENT NO.1

    AIM: To study different websites and database related to genomic research

    NCBI-The National Center for Biotechnology Information (NCBI) is part of the United

    States National Library of Medicine (NLM), a branch of the National Institutes of Health. The

    NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored

    by Senator Claude Pepper.The NCBI houses genome sequencing data in GenBank and an index

    of biomedical research articles in PubMed Central and PubMed, as well as other information

    relevant to biotechnology. All these databases are available online through the Entrez search

    engine. The Entrez Global Query Cross-Database Search System is a powerful federated

    search engine, or web portal that allows users to search many discrete health sciences databases

    at the NCBI website. The NCBI has had responsibility for making available the

    GenBank DNA sequence database since 1992. GenBank coordinates with individual laboratories

    and other sequence databases such as those of the European Molecular Biology

    Laboratory (EMBL) and the DNA Data Bank of Japan (DDBJ).Two major roles of NCBI are to

    create research in the field of computational biology and create public databases. GenBank is a

    part of International Nucleotide Sequence Database Collaboration along with europes EMBL,

    japans DDBJ.

    1. Genbank- The GenBanksequence database is an open access, annotated collection of all

    publicly available nucleotide sequences and their protein translations. This database is

    produced and maintained by the National Center for Biotechnology Information (NCBI)

    as part of the International Nucleotide Sequence Database Collaboration (INSDC). Direct

    submissions are made to GenBank using BankIt, which is a Web-based form, or the

    stand-alone submission program,Sequin. Upon receipt of a sequence submission, the

    GenBank staff assigns an accession number to the sequence and performs quality

    assurance checks. The submissions are then released to the public database, where the

    entries are retrievable by Entrez or downloadable by FTP.

  • 8/7/2019 genomics file

    4/43

    SEQUENCE SUBMISSION TOOLS include Bankit and Sequin. Bankit is used when we

    have a single sequence, a simple set of sequences or a small batch of different sequences.

    It is a web-based submission tool. Sequin is a stand-alone software tool developed by the

    NCBI for submitting and updating entries to the GenBank, EMBL, or DDBJ sequence

    databases. It is capable of handling simple submissions that contain a single short mRNA

    sequence, and complex submissions containing long sequences, multiple annotations,

    segmented sets of DNA, or phylogenetic and population studies.

    2. EMBL- The European Molecular Biology Laboratory (EMBL) is a molecular

    biology research institution supported by 20 European countries and Australia as

    associate member state. EMBL was created in 1974 and is an intergovernmental

    organisation funded by public research money from its member states. The cornerstones

    of EMBL's mission are manifold. Basic research in molecular biology and molecular

    medicine is performed; scientists, students and visitors at all levels are trained; vital

    services to scientists in the member states are offered; new instruments and methods in

    the life sciences are developed; and there is an active engagement in technology

    transfer.One of the major institutes of Europe that runs EMBL is European

    Bioinformatics Institute.

    3. DDBJ- The DNA Data Bank of Japan (DDBJ) is a DNA data bank.[1] It is located at

    the National Institute of Genetics (NIG) in theShizuoka prefecture of Japan. It is also a

    member of the International Nucleotide Sequence Database Collaboration. It exchanges

    its data with European Molecular Biology Laboratory at the European Bioinformatics

    Institute and with GenBank at the National Center for Biotechnology Information on a

    daily basis. Thus these three databanks contents the same data at any given time.

    http://en.wikipedia.org/wiki/DNA_Data_Bank_of_Japan#cite_note-pmid11752245-0http://en.wikipedia.org/wiki/DNA_Data_Bank_of_Japan#cite_note-pmid11752245-0
  • 8/7/2019 genomics file

    5/43

    TYPES OF DATABASES:

    NUCLEOTIDE DATABASES

    dbEST is a division of Genbank established in 1992. As forGenBank, data in dbEST is directly

    submitted by laboratories worldwide and is not curated.

    dbSTS is an NCBI resource that contains sequence data for short genomic landmark sequences

    or Sequence Tagged Sites. STS sequences are incorporated into the STS Division

    of GenBank.The dbSTS database offers a route for submission of STS sequences to GenBank. It

    is designed especially for the submission of large batches of STS sequences.

    The Single Nucleotide Polymorphism Database (dbSNP) is a free public archive for genetic

    variation within and across different species developed and hosted by the National Center for

    Biotechnology Information (NCBI) in collaboration with the National Human Genome Research

    Institute (NHGRI). Although the name of the database implies a collection of one class of

    polymorphisms only (i.e., single nucleotide polymorphisms (SNPs)), it in fact contains a range of

    molecular variation: (1) SNPs, (2) short deletion and insertion polymorphisms (indels/DIPs),

    (3) microsatellite markers or short tandem repeats (STRs), (4) multinucleotide polymorphisms

    (MNPs), (5) heterozygous sequences, and (6) named variants.

    The Reference Sequence (RefSeq) database is an open access, annotated and curated collection

    of publicly available nucleotide sequences (DNA, RNA) and their protein products. This

    database is built by (NCBI), and, unlike GenBank, provides only single record for each natural

    biological molecule(i.e. DNA, RNA or protein) for major organisms ranging from viruses to

    bacteria to eukaryotes.For each model organism, RefSeq aims to provide separate and linked

    records for the genomic DNA, the gene transcripts, and the proteins arising from thosetranscripts. RefSeq is limited to major organisms for which sufficient data is available.

    PROTEIN DATABASES

    http://en.wikipedia.org/wiki/GenBankhttp://en.wikipedia.org/wiki/GenBankhttp://en.wikipedia.org/wiki/GenBank
  • 8/7/2019 genomics file

    6/43

    The Protein Data Bank(PDB) is a repository for the 3-D structural data of large biological

    molecules, such as proteins and nucleic acids. (See also crystallographic database). The data,

    typically obtained by X-ray crystallography or NMR spectroscopy and submitted

    by biologists and biochemists from around the world, are freely accessible on the Internet via the

    websites of its member organisations (PDBe, PDBj, and RCSB). The PDB is overseen by an

    organization called the Worldwide Protein Data Bank, wwPDB.

    The Protein Clusters database provides easy access to annotation information, publications,

    domains, structures, and external links and analysis tools including multiple alignments,

    phylogenetic trees, and genomic neighborhoods (ProtMap).Protein Clusters can be searched like

    any other Entrez database.

    STRUCTURAL DATABASES

    The Conserved Domain Database (CDD) brings together several collections of multiple

    sequence alignments representing conserved domains, including NCBI-curated domains, which

    use 3D-structure information to explicitly to define domain boundaries and provide insights into

    sequence/structure/function relationships, as well as domain models imported from a number

    of external source databases. The data are then used for putative functional annotation of protein

    query sequences based on matches to specific hits.

    The Structural Classification of RNA (SCOR) database provides a survey of the three-

    dimensional motifs contained in 259 NMR and X-ray RNA structures. In one classification, the

    structures are grouped according to function. The RNA motifs, including internal and external

    loops, are also organized in a hierarchical classification.The 259 database entries contain 223

    internal and 203 external loops; 52 entries consist of fully complementary duplexes.

    GENOME DATABASES

    The NCBI Entrez Genome database is a collection of complete large-scale sequencing,

    assembly, annotation, and mapping projects for cellular organisms. It contains Genomic

    sequences at different stage of finishing from both the public domain sequencing effort and

    Celera Genomics, protein function data and gene structure. It helps in understanding the

  • 8/7/2019 genomics file

    7/43

    genomic organization of genes; mapping a gene, understanding the exon/intron structure of a

    gene Searching for genetic and physical markers and accessing comprehensive information

    about a gene, its transcript(s) and protein(s), structure, activity, and location.

    CHEMICAL DATABASES

    PubChem is a database of chemical molecules and their activities against biological assays.

    The system is maintained by the NCBI, a component of the National Library of Medicine,

    which is part of the United States National Institutes of Health (NIH). PubChem can beaccessed for free through a web user interface.

    Chemical Entities of Biological Interest, also known as ChEBI, is a database and ontology of

    molecular entities focused on 'small' chemical compounds, that is part of the Open Biomedical

    Ontologies effort. The term "molecular entity" refers to any "constitutionally or isotopically

    distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer, etc.,

    identifiable as a separately distinguishable entity. ChEBI uses nomenclature, symbolism and

    terminology endorsed by the International Union of Pure and Applied Chemistry (IUPAC) and

    Nomenclature Committee of the International Union of Biochemistry and Molecular Biology.

    METABOLIC OR PATHWAY DATABASES

    KEGG (Kyoto Encyclopedia of Genes and Genomes) is a collection of online databases

    dealing with genomes, enzymatic pathways, and biological chemicals. The PATHWAY

    database records networks of molecular interactions in the cells, and variants of them specific

    to particular organisms.

  • 8/7/2019 genomics file

    8/43

    LITERATURE DATABASES

    PubMed is a free database accessing primarily the MEDLINE database of references and

    abstracts on life sciences and biomedical topics. In addition to MEDLINE, it also provides

    access to OLDMEDLINE for pre-1966 records and citations to articles from MEDLINE

    journals. Citations may include links to full-text content from PubMed Central and publisher

    web sitesMEDLINE (Medical Literature Analysis and Retrieval System Online) is a

    bibliographic database of life sciences and biomedical information. It includes bibliographic

    information for articles from academic journals covering medicine, nursing, pharmacy,

    dentistry, veterinary medicine, and health care. MEDLINE also covers much of the literature

    in biology and biochemistry, as well as fields such as molecular evolution.Compiled by the

    United States National Library of Medicine (NLM), MEDLINE is freely available on the

    Internet and searchable via PubMed and NLM's National Center for Biotechnology

    Information's Entrez system.

    DISEASE DATABASES

    OMIM is a comprehensive, authoritative, and timely compendium of human genes and genetic

    phenotypes. The full-text, referenced overviews in OMIM contain information on all known

    mendelian disorders and over 12,000 genes. OMIM focuses on the relationship between

    phenotype and genotype. It is updated daily, and the entries contain copious links to other

    genetics resources.

  • 8/7/2019 genomics file

    9/43

    EXPERIMENT NO.2

    AIM: To study different tools used for genomic research

    Phred: Better Base Calling

    Phred is a base-calling program for DNA sequence traces. The program was developed by Drs.

    Phil Green and Brent Ewing, and is copyrighted by the University of Washington. It is widely

    used by the largest academic and commercial sequencing laboratories. It has a high base calling

    accuracy with 40-50% lower error rates. The highly accurate error probablilities Phred calculates

    for each base enable increase automation of the sequencing process. For example,drastically

    lower false positive error rates in mutation detection ,effective quality control immediately after

    sequence production, quantitative benchmarking of different sequencing methods and protocol

    changes. Phred was developed for the Human Genome Project, where large amounts of sequence

    data were processed by automated scripts; therefore, Phred's processing options are set by

    command line parameters. For Windows and OS X users who would like to use Phred through

    an easy-to-use graphical user interface, we have developed the sequence analysis software

    CodonCode Aligner. CodonCode Aligner greatly simplifies using Phred for base calling and

    Phrap for sequence assembly,and also offers a number of additional functions often needed in

    DNA sequencing projects, for example contig alignment and editing, reference sequence

    alignments, and mutation detection.

    Phrap: Better Sequence Assemblies

    Phrap is a leading program for DNA sequence assembly. Phrap is routinely used in some of thelargest sequencing projects in the Human Genome Sequencing Project and in the biotech

    industry. Some of Phrap's feature include:

  • 8/7/2019 genomics file

    10/43

    Fast assemblies- Assemblies of cosmid- to BAC sized projects with several hundred to two

    thousand reads typically take only minutes to complete on high-powered workstations or

    personal computers.

    Accurate consensus sequences- Phrap uses Phred's quality scores to determine highly accurate

    consensus sequences. Phrap examines all individual sequences at a given position, and generally

    uses the highest quality sequence to build consensus. Compared to simple majority rules use in

    older sequence assembly programs, Phrap's approach can give significantly more accurate

    consensus sequences.

    Consensus quality estimates- Phrap uses the quality information of individual sequences to

    estimate the quality of the consensus sequence. In addition, Phrap uses available information

    about sequencing chemistry (dye terminator or dye primer) and confirmation by "other strand"

    reads in estimating the consensus quality.

    Ability to assemble very large projects- Phrap has been used routinely to assembly bacterial

    genomes sequenced by the "shotgun" approach, where each project contained tens of thousands

    of reads. Smaller bacterial genomes (2 million bases or less) could often be assembled in less

    than three hours.

    Improved identification and handling of repeats- Phrap uses quality scores to estimate whether

    discrepancies between two overlapping sequences are more likely to arise from random errors, or

    from different copies of a repeated sequence. For repeats with 95 to 98% identity (like human

    Alu sequences) and high quality sequence data, this typically yields correct assemblies.

    Cross match: Fast DNA Sequence Comparisons and Vector Screening

    Cross match is a program for fast comparisons of DNA sequences that uses the same algorithms

    as Phrap. For example, the comparison of several hundred thousand bases of "raw" sequence to

    the sequence of an entire BAC typically takes less than one minute. Within the Phred - Phrap

    system, Cross_match is typically used for vector screening. In addition to this, it is also used for

  • 8/7/2019 genomics file

    11/43

    the Identification of overlaps between contig ends after assembly with Phrap, identification of

    potential repeat sequences in assemblies, generation of error summaries and lists after

    completion of sequencing projects and estimation of vector contamination in newly created

    libraries.

    Fgenesh

    It is a gene prediction program that falls under ab-initio gene prediction category. This is a HMM

    based program that has parameters for finding genes in humans, drosophila, plants, yeast and

    nematodes. The program does predict some genes that are not annotated as genes and fails to

    predict some genes that do exist. A new program called fgenesh+ which works for a set of

    missed genes when information about homologous protein sequences is furnished in fgenesh. It

    is better in terms of sensitivity and specificity suggesting that, while ignoring similarity to

    cDNAs, ESTs, and protein sequences may be appropriate for analyzing the ab initio part of a

    predictor algorithm, for true-life scenarios of predicting genes in newly sequenced eukaryotic

    genomes, more genes can be predicted by inclusion of these database sequences.

    Glimmer

    Glimmer is a system for finding genes in microbial DNA, especially the genomes of bacteria,

    archaea, and viruses. Glimmer (Gene Locator and Interpolated Markov ModelER) uses

    interpolated Markov models (IMMs) to identify the coding regions and distinguish them from

    noncoding DNA. The IMM approach uses a combination of Markov models from 1st through

    8th-order, weighting each model according to its predictive power. Glimmer uses 3-periodic

    nonhomogenous Markov models in its IMMs.

    Glimmer was the primary microbial gene finder used at The Institute for Genomic Research

    (TIGR), where it was first developed, and has been used to annotate the complete genomes of

    over 100 bacterial species from TIGR and other labs. Glimmer is used as basis for the design of

  • 8/7/2019 genomics file

    12/43

    glimmer M which includes an algorithm for predicting splice sites. Further improvements to

    glimmer M for the purpose of eukaryotic gene prediction resulted in the generation of glimmer

    HMM. GlimmerHMM also adds in splice site predictors adapted from the Gene Splicer program.

    Grail

    The goal of the GRAIL program is to utilize several algorithms detecting different features of a

    protein coding gene to predict with high accuracy the position of a gene within a genome.

    Originally, GRAIL examined the presence of these several features (discussed below) in a

    sliding 99-nucleotide window; however, this biases the program towards prediction of longer

    exons and misses a larger number of shorter exons. This bias was later removed by allowing the

    program to examine all possible exons, rather than just those in a sliding window. In both cases,

    GRAIL utilizes a neural network to combine predictions for all these gene features.GRAIL starts

    by scoring a region as protein coding versus protein noncoding based on frequency of 6-mers

    that occur often in coding versus noncoding sequences.These coding regions are then scored for

    the presence of a start codon, with a stop codon downstream and in-frame. A higher score is

    achieved by the presence of these features. The GRAIL algorithm can also identify frameshift

    mutations (insertions or deletions) that may be introduced do to errors during sequencing, by

    determining when a shift in frame occurs in a region with high coding potential, creating an out-

    of-frame stop codon. Splice sites are also detected, by analysis of the coding region surrounding

    splice donor sequences and splice acceptor sequences. GRAIL also scores CpG islands, which

    are underrepresented in the genome but enriched just 5 of coding regions, the presence of a

    TATA box, and the polyadenylation signal.

  • 8/7/2019 genomics file

    13/43

    EXPERIMENT NO.3

    AIM: To Study DNA sequencing methods

    The term DNA sequencing is the use of sequencing for determining the order of the nucleotide

    basesadenine, guanine, cytosine, and thyminein a molecule of DNA.Knowledge of DNA

    sequences has become indispensable for basic biological research, other research branches

    utilizing DNA sequencing, and in numerous applied fields such as diagnostic, biotechnology,

    forensic biology and biological systematics. The advent of DNA sequencing has significantly

    accelerated biological research and discovery. The rapid speed of sequencing attained with

    modern DNA sequencing technology has been instrumental in the sequencing of the human

    genome, in the Human Genome Project.

    MaxamGilbert sequencing

    In 19761977, Allan Maxam and Walter Gilbert developed a DNA sequencing method based on

    chemical modification of DNA and subsequentcleavage at specific sites.The method requires

    radioactive labeling at one 5' end of the DNA (typically by a kinase reaction using gamma-32P

    ATP) and purification of the DNA fragment to be sequenced. Chemical treatment generates

    breaks at a small proportion of one or two of the four nucleotide bases in each of four reactions

    (G, A+G, C, C+T). For example, the purines (A+G) are depurinated using formic acid, the

    guanines (and to some extent the adenines) are methylated by dimethyl sulfate, and the

    pyrimidines (C+T) are methylated using hydrazine. The addition of salt (sodium chloride) to the

    hydrazine reaction inhibits the methylation of thymine for the C-only reaction. The modified

    DNAs are then cleaved by hot piperidine at the position of the modified base. The concentration

    of the modifying chemicals is controlled to introduce on average one modification per DNA

    molecule. Thus a series of labeled fragments is generated, from the radiolabeled end to the first

    "cut" site in each molecule. The fragments in the four reactions are electrophoresed side by side

  • 8/7/2019 genomics file

    14/43

  • 8/7/2019 genomics file

    15/43

    DNA bands are then visualized by autoradiography or UV light, and the DNA sequence can be

    directly read off the X-ray film or gel image. In the image on the right, X-ray film was exposed

    to the gel, and the dark bands correspond to DNA fragments of different lengths. A dark band in

    a lane indicates a DNA fragment that is the result of chain termination after incorporation of a

    dideoxynucleotide (ddATP, ddGTP, ddCTP, or ddTTP). The relative positions of the different

    bands among the four lanes are then used to read (from bottom to top) the DNA sequence.

    Automated DNA sequencing

    Automated DNA-sequencing instruments (DNA sequencers) can sequence up to 384 DNA

    samples in a single batch (run) in up to 24 runs a day. DNA sequencers carry out capillary

    electrophoresis for size separation, detection and recording of dye fluorescence, and data output

    as fluorescent peak trace chromatograms. Sequencing reactions bythermocycling, cleanup and

    re-suspension in a buffer solution before loading onto the sequencer are performed separately. A

    number of commercial and non-commercial software packages can trim low-quality DNA traces

    automatically. These programs score the quality of each peak and remove low-quality base peaks

    (generally located at the ends of the sequence)

    http://en.wikipedia.org/wiki/DNA_sequencershttp://en.wikipedia.org/wiki/Capillary_electrophoresishttp://en.wikipedia.org/wiki/Capillary_electrophoresishttp://en.wikipedia.org/wiki/Chromatogramhttp://en.wikipedia.org/wiki/Thermocyclerhttp://en.wikipedia.org/wiki/Thermocyclerhttp://en.wikipedia.org/wiki/Buffer_solutionhttp://en.wikipedia.org/wiki/DNA_sequencershttp://en.wikipedia.org/wiki/Capillary_electrophoresishttp://en.wikipedia.org/wiki/Capillary_electrophoresishttp://en.wikipedia.org/wiki/Chromatogramhttp://en.wikipedia.org/wiki/Thermocyclerhttp://en.wikipedia.org/wiki/Buffer_solution
  • 8/7/2019 genomics file

    16/43

    EXPERIMENT NO.4

    AIM: To visualize the macromolecular structure of proteins using RASMOL

    RasMol is a computer program written for molecular graphics visualization intended and used

    primarily for the depiction and exploration of biological macromolecule structures, such as

    those found in the Protein Data Bank. It was originally developed by Roger Sayle in the

    early 90s. Maintenance of RasMol, much of the development, and integration of

    modifications provided by the community is done at the ARCiB laboratory at Dowling

    College. Work on RasMol has been supported in part by grants from the U.S. Department

    of Energy, the U.S. National Science Foundation and the U.S. NIH National Institute of

    General Medical Sciences. RasMol 2.7.5 runs on wide range of architectures and operating

    systems including Microsoft Windows, Apple Macintosh, UNIX and VMS systems. UNIX

    and VMS versions require an 8, 24 or 32 bit colour X Windows display (X11R4 or later).

    The X Windows version of RasMol [rasmol2.7.5.exe] provides optional support for a

    hardware dials box and accelerated shared memory communication (via the XInput and

    MIT-SHM extensions) if available on the current X Server.

    The program reads in a molecule coordinate file and interactively displays the molecule on

    the screen in a variety of colour schemes and molecule representations. Currently available

    representations include depth-cued wireframes, `Dreiding` sticks, spacefilling (CPK)spheres, ball and stick, solid and strand molecular ribbons, atom labels and dot surfaces.

    PROCEUDRE FOR VISUALISATION:

    Browse for Rasmol V 2.7.5 README in google search.

  • 8/7/2019 genomics file

    17/43

    Download Rasmol V 2.7.5 windows installer and save it.

    Open the pdb website(www.pdb.org) and type the pdb id or text search of the complete structure

    file of the protein of interest.(haemoglobin in this case)

    Download the file entitled structure of human deoxy hemoglobin A in complex with

    xenon.

    Open the structure file with th e help of rasmol and analyze its sequence with the help of

    functions available in the rasmol.

    http://www.pdb.org/http://www.pdb.org/
  • 8/7/2019 genomics file

    18/43

  • 8/7/2019 genomics file

    19/43

  • 8/7/2019 genomics file

    20/43

    EXPERIMENT NO.5

    AIM: To perform gene structure prediction using Genscan

    Genscan is a bioinformatics software. Its mainsail function is to acquire a DNA sequence and

    find the ORF that accord to genes.Genscan was formulated by Dr. Chris Burg who is

    currently working on his thesis. This program is not only used to predict genes in a

    sequenced set of DNA, it can also be used to determine a specific sequence using measures

    of the percentage of C+G content. There are two approaches followed by Genscan for gene

    prediction.

    Statistical patterns identification-this approach of gene prediction uses all purpose knowledge

    abour gene structure.Knowledge of gene structure includes promoter region, start and end

    sequences of intron and exon,etc.

    Sequence similarity comparision- this approach is based on similarity which takes advantage of

    the fact that if the sequence is similar to the one with which it is being compared, it will

    have the same function. But the structure of gene cannot be predicted accurately based on

    sequence information alone.

  • 8/7/2019 genomics file

    21/43

    For large scale analysis of gene, the typical strategy is to completely inactivate each gene or over

    express it. In each case, however, the resulting phenotype may not be informative. Genscan

    uses two tyepes of signal models to model different functional units. A weight matrix

    model is used for modeling promoter, polyadenylation signals, transcription initiation and

    termination signals. A modified version of the weighted array model is used for modeling

    acceptor splice sites. After the prediction of gene structure, its function and expression

    level can be investigated. Genscan can also identify disease severity.

    PROCEDURE:

    Search for Genscan on google and select genes.mit.edu/GENSCAN.html

    Now go to NCBIs homepage and search for chromosome 14 under genome databases option.

    Select the entire sequence or a part of it and paste it under the input option on the genscan

    homepage

    Fill in the entries according to the requirements of experiment and click on RUN.

    1) GENSCAN 1.0 ru31-1

    EXPERIMENT NO.6

    AIM:To perform multiple sequence alignment using CLUSTALW algorithm

    The sensitivity of the commonly used progressive multiple sequence alignment method has been

    greatly improved for the alignment of divergent protein sequences. Firstly, individual weights are

    assigned to each sequence in a partial alignment in order to downweight near-duplicate

    sequences and upweight the most divergent ones. Secondly, amino acid substitution matrices are

    varied at different alignment stages according to the divergence of the sequences to be aligned.

    Thirdly, residue specific gap penalties and locally reduced gap penalties in hydrophilic regions

    encourage new gaps in potential loop regions rather than regular secondary structure. Fourthly,

    positions in early alignments where gaps have been opened receive locally reduced gap penalties

    to encourage the opening up of new gaps at these positions. These modifications are incorporated

    into a new program, CLUSTALW. ClustalW2 is a general purpose multiple sequence alignment

    program for DNA or proteins. The basic multiple alignment algorithm consists of three main

  • 8/7/2019 genomics file

    22/43

    stages: 1) all pairs of sequences are aligned separately in order to calculate a distance matrix

    giving the divergence of each pair of sequences; 2) a guide tree is calculated from the distance

    matrix; 3) the sequences are progressively aligned according to the branching order in the guide

    tree.

    PROCEDURE: On google webpage, search for CLUSTALW, Select the required page and

    follow the steps:

    Step 1 - Sequence

    Sequence Input Window

    Three or more sequences to be aligned can be entered directly into this form. Sequences can be

    be in GCG, FASTA, EMBL, GenBank, PIR, NBRF, PHYLIP or UniProtKB/Swiss-Prot format.

    Partially formatted sequences are not accepted. Adding a return to the end of the sequence may

    help certain applications understand the input. Note that directly using data from word processors

    may yield unpredictable results as hidden/control characters may be present.

    Sequence File Upload

    A file containing three or more valid sequences in any format (GCG, FASTA, EMBL, GenBank,

    PIR, NBRF, PHYLIP or UniProtKB/Swiss-Prot) can be uploaded and used as input for the

    multiple sequence alignment. Word processor files may yield unpredictable results as

    hidden/control characters may be present in the files. It is best to save files with the Unix format

    option to avoid hidden Windows characters.

    Sequence Type

    Indicates if the sequences to align are protein or nucleotide (DNA/RNA).

    Type Abbreviation

    Protein protein

    DNA dna

  • 8/7/2019 genomics file

    23/43

    Default value is: Protein [protein]

    Step 2 - Pairwise Alignment Options

    Alignment Type

    The alignment method used to perform the pairwise alignments used to generate the guide tree.

    Output

    FormatDescription Abbreviation

    slow Slow, but accurate slow

    fast Fast, but approximate fast

    Default value is: slow

    Protein Weight Matrix (PW)

    Slow pairwise alignment protein sequence comparison matrix series used to score alignment.

    Matrix (Protein Only) Description Abbreviation

    BLOSUM blosum

    PAM pam

    Gonnet gonnet

    ID id

    Default value is: Gonnet [gonnet]

    DNA Weight Matrix (PW)

    Slow pairwise alignment nucleotide sequence comparison matrix used to score alignment.

    Matrix (Protein Only) Description Abbreviation

    IUB iub

  • 8/7/2019 genomics file

    24/43

    Matrix (Protein Only) Description Abbreviation

    ClustalW clustalw

    Default value is: IUB [iub]

    Gap Open (PW)

    Slow pairwise alignment score for the first residue in a gap.

    Default value is: 10

    Gap Extension (PW)

    Slow pairwise alignment score for each additional residue in a gap.

    Default value is: 0.1

    KTUP

    Fast pairwise alignment word size used to find matches between the sequences. Decrease for

    sensitivity; increase for speed.

    Default value is: 1

    Window Length

    Fast pairwise alignment window size for joining word matches. Decrease for speed; increase for

    sensitivity.

    Default value is: 5

    Score Type

    Fast pairwise alignment score type to output.

  • 8/7/2019 genomics file

    25/43

    Order Description Abbreviation

    percent Percent

    absolute Absolute

    Default value is: percent

    Top Diags

    Fast pairwise alignment number of match regions are used to create the pairwise alignment.

    Decrease for speed; increase for sensitivity.

    Default value is: 5

    Pair Gap

    Fast pairwise alignment gap penalty for each gap created.

    Default value is: 3

    Step 3 - Multiple Sequence Alignment Options

    Protein Weight Matrix

    Multiple alignment protein sequence comparison matrix series used to score the alignment.

    Matrix (Protein Only) Description Abbreviation

    BLOSUM blosum

    PAM pam

    Gonnet gonnet

    ID id

  • 8/7/2019 genomics file

    26/43

    Default value is: Gonnet [gonnet]

    DNA Weight Matrix

    Multiple alignment nucleotide sequence comparison matrix used to score the alignment.

    Matrix (Protein Only) Description Abbreviation

    IUB iub

    ClustalW clustalw

    Default value is: IUB [iub]

    Gap Open

    Multiple alignment penalty for the first residue in a gap.

    Default value is: 10

    Gap Extension

    Multiple alignment penalty for each additional residue in a gap.

    Default value is: 0.20

    Gap Distances

    Multiple alignment gaps that are closer together than this distance are penalised.

    Default value is: 5

    No End Gaps

    Multiple alignment disable the gap seperation penalty when scoring gaps the the ends of the

    alignment

  • 8/7/2019 genomics file

    27/43

    Order Description Abbreviation

    no False

    yes True

    Default value is: no [false]

    Iteration

    Multiple alignment improvement iteration type

    Order Description Abbreviation

    none No iteration none

    tree Iteration at each step of alignment process tree

    alignment Iteration only on final alignment alignment

    Default value is: none

    Num Iter

    Maximum number of iterations to perform

    Default value is: 1

    Clustering

    Clustering type.

    Order Description Abbreviation

    NJ Neighbour-joining (Saitou and Nei 1987) NJ

    UPGMA UPGMA clustering UPGMA

    Default value is: NJ

  • 8/7/2019 genomics file

    28/43

    Output

    Format for generated multiple sequence alignment.

    Order Description Abbreviation

    Aln w/numbers ClustalW alignment format with base/residue numbering aln1

    Aln wo/numbers ClustalW alignment format without base/residue numbering aln2

    GCG MSF GCG Multiple Sequence File (MSF) alignment format Gcg

    PHYLIP PHYLIP interleaved alignment format Phylip

    NEXUS NEXUS alignment format Nexus

    NBRF/PIR NBRF or PIR sequence format Pir

    GDE GDE sequence format Gde

    Pearson/FASTA Pearson or FASTA sequence format Fasta

    Default value is: Aln w/numbers [aln1]

    Order

    The order in which the sequences appear in the final alignment

    Order Description Abbreviation

    aligned Determined by the guide tree aligned

    input Same order as the input sequences input

    Default value is: aligned

  • 8/7/2019 genomics file

    29/43

    Step 4 - Submission

    Job title

    It's possible to identify the tool result by giving it a name. This name will be associated to the

    results and might appear in some of the graphical representations of the results.

    Email Notification

    Running a tool is usually an interactive process, the results are delivered directly to the browser

    when they become available. Depending on the tool and its input parameters, this may take quite

    a long time. It's possible to be notified by email when the job is finished by simply ticking the

    box "Be notified by email". An email with a link to the results will be sent to the email address

    specified in the corresponding text box. Email notifications require valid email addresses.

    Email Address

    If email notification is requested, then a valid Internet email address in the

    [email protected] must be provided. This is not required when running the tool interactively

    (The results will be delivered to the browser window when they are ready).

    mailto:[email protected]:[email protected]:[email protected]
  • 8/7/2019 genomics file

    30/43

    CLUSTAL 2.1 multiple sequence alignment

    gi|166362739|ref|NM_001992.3|

    AGAGACTCTCACTGCACGCCGGAGGGCGCCCTTCCTCGCTCGCGCCCGCG 50gi|133892391|ref|NM_010169.3|

    --------------------------------------------------

    gi|166362739|ref|NM_001992.3|

    CGACCGCGCGCCCCAGTCCCGCCCCGCCCCGCTAACCGCCCCAGACACAG 100gi|133892391|ref|NM_010169.3| ------------------------------GCTA-----

    CTCAGAAA--- 12**** *

    **** *

    gi|166362739|ref|NM_001992.3|

    CGCTCGCCGAGGGTCGCTTGGACCCTGATCTTACCCGTGGGCACCCTGCG 150gi|133892391|ref|NM_010169.3| --------GAAG------TAGGC---GA------CGGCGGGCGCC----- 34

    ** * * * * ** * * ****

    **

    gi|166362739|ref|NM_001992.3|

    CTCTGCCTGCCGCGAAGACCGGCTCCCCGACCCGCAGAAGTCAGGAGAGA 200gi|133892391|ref|NM_010169.3| ---------------GGGCCG-----------

    CGC--------------- 43* *** ***

    gi|166362739|ref|NM_001992.3|

    GGGTGAAGCGGAGCAGCCCGAGGCGGGGCAGCCTCCCGGAGCAGCGCCGC 250

    gi|133892391|ref|NM_010169.3|-------------------------GGGCAGCCTT--------------- 53

    *********

    gi|166362739|ref|NM_001992.3|

    GCAGAGCCCGGGACAATGGGGCCGCGGCGGCTGCTGCTGGTGGCCGCCTG 300gi|133892391|ref|NM_010169.3|

    ---------GGGACAATGGGGCCCCGGCGCTTGCTGATCGTCGCCCTCGG 94************** ***** ***** * **

    *** * *

    gi|166362739|ref|NM_001992.3|

    CTTCAGTCTGTGCGGCCCGCTGTTGTCTGCCCGCACCCGGGCCCGCAGGC 350

    gi|133892391|ref|NM_010169.3|CCTCAGCCTGTGCGGTCCCTTGCTGTCTTCCCGCGTCCCTATGAGCCAGC 144

    * **** ******** ** ** ***** ***** **** **

    gi|166362739|ref|NM_001992.3|

    CAGAATCAAAAGCAACAAATGCCACCTTAGATCCCCGGTCATTTCTTCTC 400gi|133892391|ref|NM_010169.3|

    CAGAATCAGAGAGGACAGATGCTACGGTGAACCCCCGCTCATTCTTTCTA 194

  • 8/7/2019 genomics file

    31/43

    ******** * *** **** ** * * *****

    ***** ****

    gi|166362739|ref|NM_001992.3| AGGAACCCCAATGATAA---

    ATATGAACCATTTT------------GGGA 435gi|133892391|ref|NM_010169.3|

    AGGAATCCCAGTGAAAATACATTTGAACTGGTCCCCCTGGGGGATGAGGA 244***** **** *** ** ** ***** *

    ***

    gi|166362739|ref|NM_001992.3|GGATGAGGAGAAAAATGAAAGTGGGTTAACTGAATACAGATTAGTCTCCA 485

    gi|133892391|ref|NM_010169.3|GGAGGAGGAGAAAAATGAAAGCGTCCTGCTGGAGGGTAGGGCAGTCTACT 294

    *** ***************** * * ** **

    ***** *

    gi|166362739|ref|NM_001992.3|

    TCAATAAAAGCAGTCCTCTTCAAAAACAACTTCCTGCATTCATCTCAGAA 535gi|133892391|ref|NM_010169.3|

    TAAATATAAGCCTCCCTCCTCACACGCCGCCTCCTCCCTTCATCTCCGAG 344* **** **** **** *** * * * **** *

    ******** **

    gi|166362739|ref|NM_001992.3|GATGCCTCCGGATATTTGACCAGCTCCTGGCTGACACTCTTTGTCCCATC 585

    gi|133892391|ref|NM_010169.3|GACGCCTCCGGATATCTGACCAGCCCCTGGCTGACGCTCTTCATGCCCTC 394

    ** ************ ******** ********** *****

    * ** **

    gi|166362739|ref|NM_001992.3|

    TGTGTACACCGGAGTGTTTGTAGTCAGCCTCCCACTAAACATCATGGCCA 635gi|133892391|ref|NM_010169.3|

    CGTGTACACGATTGTGTTCATTGTCAGCCTTCCTCTGAACGTCCTGGCCA 444******** ***** * ******** ** ** ***

    ** ******

    gi|166362739|ref|NM_001992.3|TCGTTGTGTTCATCCTGAAAATGAAGGTCAAGAAGCCGGCGGTGGTGTAC 685

    gi|133892391|ref|NM_010169.3|TCGCAGTGTTCGTCTTGAGGATGAAGGTCAAGAAGCCGGCCGTGGTGTAC 494

    *** ****** ** *** *****************************

    gi|166362739|ref|NM_001992.3|

    ATGCTGCACCTGGCCACGGCAGATGTGCTGTTTGTGTCTGTGCTCCCCTT 735

    gi|133892391|ref|NM_010169.3|ATGCTGCACCTGGCCATGGCCGACGTGCTCTTCGTGTCGGTGCTCCCCTT 544

    **************** *** ** ***** ** *****

    ***********

    gi|166362739|ref|NM_001992.3|TAAGATCAGCTATTACTTTTCCGGCAGTGATTGGCAGTTTGGGTCTGAAT 785

    gi|133892391|ref|NM_010169.3|CAAGATCAGCTACTACTTCTCCGGCACTGATTGGCAGTTCGGGTCTGGAA 594

  • 8/7/2019 genomics file

    32/43

    *********** ***** ******* ************

    ******* *

    gi|166362739|ref|NM_001992.3|

    TGTGTCGCTTCGTCACTGCAGCATTTTACTGTAACATGTACGCCTCTATC 835gi|133892391|ref|NM_010169.3|

    TGTGCCGCTTCGCCACCGCAGCGTTTTACGGGAACATGTACGCCTCCATC 644**** ******* *** ***** ****** *

    ************** ***

    gi|166362739|ref|NM_001992.3|TTGCTCATGACAGTCATAAGCATTGACCGGTTTCTGGCTGTGGTGTATCC 885

    gi|133892391|ref|NM_010169.3|ATGCTCATGACGGTCATAAGCATTGACCGGTTCCTGGCGGTGGTGTATCC 694

    ********** ******************** *****

    ***********

    gi|166362739|ref|NM_001992.3|

    CATGCAGTCCCTCTCCTGGCGTACTCTGGGAAGGGCTTCCTTCACTTGTC 935gi|133892391|ref|NM_010169.3|

    GATCCAGTCCCTGTCCTGGCGCACTCTGGGCCGAGCCAACTTCACTTGCG 744** ******** ******** ******** * **

    *********

    gi|166362739|ref|NM_001992.3|TGGCCATCTGGGCTTTGGCCATCGCAGGGGTAGTGCCTCTGCTCCTCAAG 985

    gi|133892391|ref|NM_010169.3|TGGTCATTTGGGTGATGGCCATCATGGGGGTGGTGCCCCTTCTCCTCAAG 794

    *** *** **** ******** ***** ***** **

    *********

    gi|166362739|ref|NM_001992.3|

    GAGCAAACCATCCAGGTGCCCGGGCTCAACATCACTACCTGTCATGATGT 1035gi|133892391|ref|NM_010169.3|

    GAGCAGACCACCCGAGTTCCGGGACTCAACATCACCACCTGCCACGACGT 844***** **** ** ** ** ** *********** *****

    ** ** **

    gi|166362739|ref|NM_001992.3|GCTCAATGAAACCCTGCTCGAAGGCTACTATGCCTACTACTTCTCAGCCT 1085

    gi|133892391|ref|NM_010169.3|CCTCAGTGAGAACCTGATGCAAGGCTTTTACTCGTACTACTTCTCGGCCT 894

    **** *** * **** * ****** ** ************ ****

    gi|166362739|ref|NM_001992.3|

    TCTCTGCTGTCTTCTTTTTTGTGCCGCTGATCATTTCCACGGTCTGTTAT 1135

    gi|133892391|ref|NM_010169.3|TCTCCGCCATCTTCTTTCTTGTGCCGTTGATCGTTTCCACGGTCTGCTAC 944

    **** ** ******** ******** *****

    ************* **

    gi|166362739|ref|NM_001992.3|GTGTCTATCATTCGATGTCTTAGCTCTTCCGCAGTTGCCAACCGCAGCAA 1185

    gi|133892391|ref|NM_010169.3|ACGTCCATCATCCGGTGCCTGAGCTCCTCCGCGGTTGCCAACCGGAGCAA 994

  • 8/7/2019 genomics file

    33/43

    *** ***** ** ** ** ***** *****

    *********** *****

    gi|166362739|ref|NM_001992.3|

    GAAGTCCCGGGCTTTGTTCCTGTCAGCTGCTGTTTTCTGCATCTTCATCA 1235gi|133892391|ref|NM_010169.3|

    GAAGTCGCGGGCTTTGTTCCTGTCTGCCGCGGTGTTCTGCATCTTCATCG 1044****** ***************** ** ** **

    ***************

    gi|166362739|ref|NM_001992.3|TTTGCTTCGGACCCACAAACGTCCTCCTGATTGCGCATTACTCATTCCTT 1285

    gi|133892391|ref|NM_010169.3|TCTGCTTTGGGCCCACCAACGTCCTCCTGATTGTGCACTACCTTTTCCTC 1094

    * ***** ** ***** **************** *** ***

    *****

    gi|166362739|ref|NM_001992.3|

    TCTCACACTTCCACCACAGAGGCTGCCTACTTTGCCTACCTCCTCTGTGT 1335gi|133892391|ref|NM_010169.3|

    TCCGACAGTCCTGGTACAGAGGCAGCCTACTTTGCTTACCTCCTCTGCGT 1144** *** * * ******** ***********

    *********** **

    gi|166362739|ref|NM_001992.3|CTGTGTCAGCAGCATAAGCTGCTGCATCGACCCCCTAATTTACTATTACG 1385

    gi|133892391|ref|NM_010169.3|CTGTGTGAGCAGCGTGAGCTGCTGCATCGATCCGTTGATTTACTACTACG 1194

    ****** ****** * ************** ** *

    ******** ****

    gi|166362739|ref|NM_001992.3|

    CTTCCTCTGAGTGCCAGAGGTACGTCTACAGTATCTTATGCTGCAAAGAA 1435gi|133892391|ref|NM_010169.3|

    CCTCCTCCGAGTGCCAGAGGCACCTCTACAGCATCTTGTGCTGCAAAGAA 1244* ***** ************ ** ******* *****

    ************

    gi|166362739|ref|NM_001992.3|AGTTCCGATCCCAGCAGTTATAACAGCAGTGGGCAGTTGATGGCAAGTAA 1485

    gi|133892391|ref|NM_010169.3|AGCTCTGATCCCAACAGTTGCAACAGCACCGGCCAGCTGATGCCGAGTAA 1294

    ** ** ******* ***** ******* ** *** ****** *****

    gi|166362739|ref|NM_001992.3|

    AATGGATACCTGCTCTAGTAACCTGAATAACAGCATATACAAAAAGCTGT 1535

    gi|133892391|ref|NM_010169.3|AATGGATACCTGCTCTAGTCACCTGAATAACAGCATATACAAAAAGCTAT 1344

    *******************

    **************************** *

    gi|166362739|ref|NM_001992.3| TAACTTAGGAAAAGGGACTGCTGGGAGGTTAAA-AAGAAAAGTTTATAAA 1584

    gi|133892391|ref|NM_010169.3| TAGCTTAGGGAAAGGG-TTGCTGGAAGGTTCCATGAGAAAAGGTTG-GAA 1392

  • 8/7/2019 genomics file

    34/43

    ** ****** ****** ****** ***** * *******

    ** **

    gi|166362739|ref|NM_001992.3| AGTGAATAACCTGAGGATTCTATTAGTCCCCACCCA-

    AACTTTATTGA-T 1632gi|133892391|ref|NM_010169.3| AGCCAACAGCG-

    GGGAATCCCATTAGTCCCTGCAAAGAACTGTATTTACT 1441** ** * * * * ** * ********* * * ****

    **** * *

    gi|166362739|ref|NM_001992.3| TCACCTCCTAAAA--CAACAGATGTACGACTTGCATACCTGCTTTTTATG 1680

    gi|133892391|ref|NM_010169.3|TCGAAACCTAAAAAACAACCAATATCCGATATGCACGAATACTTCT---- 1487

    ** ******* **** ** * *** **** *

    *** *

    gi|166362739|ref|NM_001992.3|

    GGAGCTGTCAAGCATGTATTTTTGTCAATTACCAGAAAGATAACAGGAC- 1729gi|133892391|ref|NM_010169.3|

    ---GCTATCAAGAGTCTAGATTGGATAATTACCAGCAAGGTGACGGGAAC 1534*** ***** * ** ** * ********* *** *

    ** ***

    gi|166362739|ref|NM_001992.3|-GAGATGACGGTGTTATTCCAAGGGAATATTGCCAATGCTACAGTAATAA 1778

    gi|133892391|ref|NM_010169.3| GGAAATAAAGGTGT----CCAG-----TGTTGCTAGTGCTATGATAGTAA 1575

    ** ** * ***** *** * **** * *****

    ** ***

    gi|166362739|ref|NM_001992.3|

    ATGAATGTCACTTCTGGATATAGCTAGGTGACATATACATACTTACATGT 1828gi|133892391|ref|NM_010169.3| CTGGATGTCACTTCTT-ATATATCTAGGTGAC---------

    TTTA----- 1610** *********** ***** *********

    ***

    gi|166362739|ref|NM_001992.3| GTGTATATGTAGATG-TATGCACACACATATATTATTTGCAGTGCAGTAT 1877

    gi|133892391|ref|NM_010169.3| ----ATATATAGATGGTATGCACACAC-----TCATTTGTCATGCAGGAG 1651

    **** ****** *********** * ********** *

    gi|166362739|ref|NM_001992.3|

    AGAATAGGCACTTTAAAACACTCTTTCCCCGCACCCCAGCAATTA---TG 1924

    gi|133892391|ref|NM_010169.3| GGAATCTGCACTTTGACACA-TTTTTGTTTATTCCCTGGCCGTTACTATG 1700

    **** ******* * *** * *** *** **

    *** **

    gi|166362739|ref|NM_001992.3|AAAATAATCTCTGATTCCCTGATTTAATATGCAAAGTCTAGGTTGGTAGA 1974

    gi|133892391|ref|NM_010169.3| GAAATAATCT--GATTCTCTGACTTAATAAACAAAGTCTGAGTTGGTGGG 1748

  • 8/7/2019 genomics file

    35/43

    ********* ***** **** ****** ********

    ****** *

    gi|166362739|ref|NM_001992.3|

    GTTTAGCCCTGAACATTTCATGGTGTTCATCAACAGTGAGAGACTCCATA 2024gi|133892391|ref|NM_010169.3| TGTTAGCACTGGGCAGCTGGAGATCCTAAT-

    GATAGGGGAGGAGTCCGTA 1797***** *** ** * * * * ** * ** *

    ** *** **

    gi|166362739|ref|NM_001992.3| GTTTGGGCTTG-TACCACTTTTGCAAATAAGTGTATTTTGAAATTGTTTG 2073

    gi|133892391|ref|NM_010169.3| GTTTAGACTTAACACAGCTTTTGCCTATA--TTTTTTTTCAAATTATTTG 1845

    **** * *** ** ******* *** * * ****

    ***** ****

    gi|166362739|ref|NM_001992.3|

    ACGGCAAGGTTTAAGTTATTAAGAGGTAAGACTTAGTACTATCTGTGC-G 2122gi|133892391|ref|NM_010169.3| ATAATAATGGTTA-GTGATGGAAGGATGAGAC--

    AGTATTACCTGTGTAG 1892* ** * *** ** ** * * * **** **** **

    ***** *

    gi|166362739|ref|NM_001992.3|TAGAAGTTCTAGTGTTTTCAATTTTAAACATATCCAAGTTTGAATTCCTA 2172

    gi|133892391|ref|NM_010169.3|GGGAAGCTCTAATACTTTTCATCTTGAACATACCGTAGTTTTAA------ 1936

    **** **** * *** ** ** ****** * *****

    **

    gi|166362739|ref|NM_001992.3|

    AAATTATGGAAACAGATGAAAAGCCTCTGTTTTGATATGGGTAGTATTTT 2222gi|133892391|ref|NM_010169.3| GAATTATCAAGGCTGTTGGAAAACCC--

    GTTTTGATATGGGTAGCATTTT 1984****** * * * ** *** **

    **************** *****

    gi|166362739|ref|NM_001992.3| TT---------ACATTTTACACACTGTACACATAAGCCAAAACTGAGCAT 2263

    gi|133892391|ref|NM_010169.3|TTTTTTAACTTGCAATTTACTTACTGAATACATGGACCAAGACTGAGCAT 2034

    ** ** ***** **** * **** *************

    gi|166362739|ref|NM_001992.3| AAGTCCT-

    CTAGTGAATGTAGGCTGGCTTTCAGAGTAGGCTATTCCTGAG 2312

    gi|133892391|ref|NM_010169.3| AAGACTCACCAG-GACTGTAATAAACCTTACAAAGCAG-CCAAGCCT--- 2079

    *** * * ** ** **** *** ** ** ** * *

    ***

    gi|166362739|ref|NM_001992.3|AGCTGCATGTGTCCGCCCCCGATGGAGGACTCCAGGCAGCAGACACATGC 2362

    gi|133892391|ref|NM_010169.3| AGACACAGCCATCTGC-----ATGGAGGCCTCTGAGCACCAGGTACAT-- 2122

  • 8/7/2019 genomics file

    36/43

    ** ** ** ** ******* *** *** ***

    ****

    gi|166362739|ref|NM_001992.3|

    CAGGGCCATGTCAGACACAGATTGGCCAGAAACCTTCCTGCTGAGCCTCA 2412gi|133892391|ref|NM_010169.3| CACACCCCT------------TCGGCTATG---

    CCTCCCAGAGAGC---- 2153** ** * * *** * * ***

    ****

    gi|166362739|ref|NM_001992.3|CAGCAGTGAGACTGGGGCCACTACATTTGCTCCATCCTCCTGGGATT--- 2459

    gi|133892391|ref|NM_010169.3| -AGAGATG-GATGGGAAGCACCAGGCCCACCCCATCCTGCTAGGATTCTC 2201

    ** ** ** ** *** * * ******* **

    *****

    gi|166362739|ref|NM_001992.3|

    ---GGCTGTGAACTGATCATGTTTATGAGAAACTGGCAAAGCAGAATGTG 2506gi|133892391|ref|NM_010169.3|

    ATTAGCTGTGAGCTGACTGTGTCTTTTAGAAATTGGCAAGGTAAGGTATG 2251******* **** *** * * ***** ****** *

    * * **

    gi|166362739|ref|NM_001992.3|ATATCCTAGGAGGTAATGACCATGAAAGACTTCTCTACCCATCTTAAAAA 2556

    gi|133892391|ref|NM_010169.3|CCATCTTGGGAGGCAGTAACTATGAAAGACT------------------- 2282

    *** * ***** * * ** **********

    gi|166362739|ref|NM_001992.3|CAACGAAAGAAGGCATGGACTTCTGGATGCCCATCCACTGGGTGTAAACA 2606

    gi|133892391|ref|NM_010169.3| -GACGAGAGGAGAAA-------------------------GGTGTGTTTA 2306

    **** ** ** ****** *

    gi|166362739|ref|NM_001992.3|

    CATCTAGTAGTTGTTCTGAAATGTCAGTTCTGATATGGAAGCACCCATT- 2655gi|133892391|ref|NM_010169.3|

    CATCCAGTAGCTGTCCTGCAAGGCTGGCCCTTGCACAGACAGACACACCC 2356**** ***** *** *** ** * * ** * **

    ** **

    gi|166362739|ref|NM_001992.3| ATGCGCTGTGGCCACTCCAATAGGTGCTGAG---TGTACAGAGT---GGA 2699

    gi|133892391|ref|NM_010169.3|

    ACATGCCCTGGTCACACTGTTGGATAGTGGGCCATAGACTGACTATAGGA 2406* ** *** *** * * * * ** * * ** **

    * ***

    gi|166362739|ref|NM_001992.3| ATAAGACAGAGACCTGCCCTCAA--

    GAGCAAAGTAGA------------- 2734gi|133892391|ref|NM_010169.3|

    GAATAACCGAGTCCTGTCCTTACTCAGGCAACGCAGAGAGCTGGCATGTG 2456* ** *** **** *** * **** * ***

  • 8/7/2019 genomics file

    37/43

    gi|166362739|ref|NM_001992.3| --------TCATGCATAGAG----TGT-----

    GATGTATGTGTAATAAAT 2767

    gi|133892391|ref|NM_010169.3|GTCAGCTATGATGCACATAGAACTTGTCTTCAGCTGGATGTG-ACCAAGT 2505

    * ***** * ** *** * ** ****** ** *

    gi|166362739|ref|NM_001992.3|

    ATGTTTCACACAAACAAGGCCTGTCAGCTAAAGAAGTTTGAACATTTGGG 2817

    gi|133892391|ref|NM_010169.3|

    GTATTTCACATAAGCAAGGCCTATCAGCTAAACTGCTTTGCATATCTGAG 2555* ******* ** ******** ********* **** *

    ** ** *

    gi|166362739|ref|NM_001992.3|

    TTACTATTTCTTGTGGTTATAACTTAATGAAAACAATGCAGTACAGGACA 2867

    gi|133892391|ref|NM_010169.3| TTTCTGCTTCCAGTAGCTATAGATTAG-GATAAAAACACAGTATAAGATG 2604

    ** ** *** ** * **** *** ** ** ******* * **

    gi|166362739|ref|NM_001992.3| TATATTTTTTAAA-ATAAGTCT---GATTTA----

    ATTGGGCACTATTTA 2909

    gi|133892391|ref|NM_010169.3|

    TATATTTTTAATACATATGCCCTTCAGCCTACAAAATTACACACTATTTA 2654********* * * *** * * ** ***

    *********

    gi|166362739|ref|NM_001992.3|

    TTTACAAATGTTTTGCTCAATAGATTGCTCAAATCAGGTTTTCTTTTAAG 2959

    gi|133892391|ref|NM_010169.3| TTTACAAATGTTTT-TTCAA-AAATTACTCAAATCAG--------CCAGG 2694

    ************** **** * *** *********** *

    gi|166362739|ref|NM_001992.3|

    AATCAATCATGTCAGTCTGCTTAGAAATAACAGAAGAAAATAGAATTGAC 3009

    gi|133892391|ref|NM_010169.3| CAT----TATGGTATACACCTT-----

    TAATCCCAGAACTTGGGA--GGC 2733** *** * * *** *** **** *

    * * * *

    gi|166362739|ref|NM_001992.3|ATTGAAATCTAGGAAAATTATTCTATAATTTCCATTTACTTAAGACTTAA 3059

    gi|133892391|ref|NM_010169.3| A--GAGG--CAGGCAGATC-TTAAACAATTT---TTTTTTTAAGAAACAA 2775

    * ** *** * ** ** * ***** ***

    ****** **

    gi|166362739|ref|NM_001992.3|

    TGAGACTTTAAAAGCATTTTTTAACCTCCTAAGTATCAAGTATAGAAAAT 3109

    gi|133892391|ref|NM_010169.3| GCAAACACAAAAAG----TTTTA----CTTAAGT-

    CCAA----------- 2805* ** ***** ***** * ***** ***

    gi|166362739|ref|NM_001992.3|

    CTTCATGGAATTCACAAAGTAATTTGGAAATTAGGTTGAAACATATCTCT 3159

  • 8/7/2019 genomics file

    38/43

    gi|133892391|ref|NM_010169.3| TTTTAAGAAATATATAGGTCAGTTTGG---

    TTA----------------- 2835

    ** * * *** * * * ***** ***

    gi|166362739|ref|NM_001992.3|TATCTTACGAAAAAATGGTAGCATTTTAAACAAAATAGAAAGTTGCAAGG 3209

    gi|133892391|ref|NM_010169.3| -----------AAAATAATAGTA------ATGAA--AGGAAATTTCA--- 2863

    ***** *** * * ** ** **

    ** **

    gi|166362739|ref|NM_001992.3|

    CAAATGTTTATTTAAAAGAGCAGGCCAGGCGCGGTGGCTCACGCCTGTAA 3259gi|133892391|ref|NM_010169.3|

    -------TTGATTGAAA----------------------------TTTAT 2878

    ** ** ***

    * **

    gi|166362739|ref|NM_001992.3|TCCCAGCACTTTGGGAGGCTGAGGCGGGTGGATCACGAGGTCAGGAGATC 3309

    gi|133892391|ref|NM_010169.3| TCT--GTATTTT--------------------TCTTGAGTT------ATT 2900

    ** * * *** ** *** *

    **

    gi|166362739|ref|NM_001992.3|

    GAGACCATCCTGGCTAACACGGTGAAACCCGTCTCTACTAAAAATGCAAA 3359gi|133892391|ref|NM_010169.3| GAGATTATTT-----------GTAAAGC--ATTTTT------

    AATGCCAC 2931

    **** ** ** ** * * * *

    ***** *

    gi|166362739|ref|NM_001992.3|AAAAATTAGCCGGGCGTGGTGGCAGGCACCTGTAGTCCCAGCTACTCGGG 3409

    gi|133892391|ref|NM_010169.3| AGTGACTA-------------ACAAGCATATAAAATCTTCA-TAC----- 2962

    * * ** ** *** * * **

    ***

    gi|166362739|ref|NM_001992.3|

    AGGCTGAGGCAGGAGACTGGCGTGAACCCAGGAGGCGGACCTTGTAGTGA 3459gi|133892391|ref|NM_010169.3| ---CTTTGACAAAA---

    TAATTTGAA-------------------AATTA 2987** * ** * * ****

    * * *

    gi|166362739|ref|NM_001992.3|

    GCCGAGATCGCGCCACTGTGCTCCAGCCTGGGCAACAGAGCAAGACTCCA 3509gi|133892391|ref|NM_010169.3| ATTTAAAACATATCCTTTTTCT--------GATGAAAAAATATGTTGGCA 3029

    * * * * * * ** * * * * *

    * **

    gi|166362739|ref|NM_001992.3| TCTCAAA-

    AAATAAAAATAAATAAAAAATAAAAAAATAAAAGAGCAAACT 3558gi|133892391|ref|NM_010169.3| TTTTAAGCAAATAAGAGTAGA--

    AAGGTTGTTTATTTAAGAGAACAAAGT 3077

  • 8/7/2019 genomics file

    39/43

    * * ** ****** * ** * ** * * ***

    *** **** *

    gi|166362739|ref|NM_001992.3|

    ATTTCCAAATACCATAGAATAACTTACATAAAAGTAATATAACTGTATTG 3608gi|133892391|ref|NM_010169.3|

    ATTTCCAAATACTGTAGAGTCGCTTCCACGAAAGTCCTATGGTTGTATGG 3127************ **** * *** ** ***** ***

    ***** *

    gi|166362739|ref|NM_001992.3|TAAGTAGAAGCTAGCACTGGTTTTATTAATTTAGTGACTATTCATTTTAT 3658

    gi|133892391|ref|NM_010169.3| TTAAC-----TTGGTTCCGGTGTT-----------GGCTG--------AT 3153

    * * * * * *** ** * **

    **

    gi|166362739|ref|NM_001992.3|

    CTAAATCAGTGAAGATTTACTGTCATTGTTTATTAGTCTGTATATATTAA 3708gi|133892391|ref|NM_010169.3| CTCAATTACTGA---CTCCCTGTC-CCGTGT-----

    TCTGTCTGTGACTT 3194** *** * *** * ***** ** * *****

    * *

    gi|166362739|ref|NM_001992.3| AATATGA-TATCATTAATGTACTTACAAAATAGTATGTCACTGTTTTTAT 3757

    gi|133892391|ref|NM_010169.3|AATGTAACTGTTATCACCGCGCTTGTGACCTTTTACGTCATTGTTTT-GT 3243

    *** * * * * ** * * *** * * ** ****

    ****** *

    gi|166362739|ref|NM_001992.3| GTTCA-----

    TTCTTAAAAACATAACCTGTATTAATAAATGTGAACATTT 3802gi|133892391|ref|NM_010169.3| GTTCACCCTCTTTTTTAAAAAAAAA--TATATTAATAAAC-

    TAAAACCAT 3290***** ** ** **** * ** * ********** *

    ** *

    gi|166362739|ref|NM_001992.3|GCTTGGTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 3847

    gi|133892391|ref|NM_010169.3|GCTTGG--------------------------------------- 3296

    ******

  • 8/7/2019 genomics file

    40/43

    RESULTS:

    Result files

    Input Sequences

    clustalw2-I20110331-115850-0033-1317260-oy.input

    Tool Ouput

    clustalw2-I20110331-115850-0033-1317260-oy.output

    Alignments in CLUSTALW format

    clustalw2-I20110331-115850-0033-1317260-oy.clustalw

    Guide Tree

    clustalw2-I20110331-115850-0033-1317260-oy.dnd

    Scores Table

    SeqA Name Length SeqB Name Length Score

    1 gi|166 362739|ref|NM_ 001992.3| 3847 2 gi|133892 391|ref|NM_010 169.3| 3 296 70.0

    http://www.ebi.ac.uk/Tools/services/rest/clustalw2/result/clustalw2-I20110331-115850-0033-1317260-oy/sequencehttp://www.ebi.ac.uk/Tools/services/rest/clustalw2/result/clustalw2-I20110331-115850-0033-1317260-oy/outhttp://www.ebi.ac.uk/Tools/services/rest/clustalw2/result/clustalw2-I20110331-115850-0033-1317260-oy/aln-clustalwhttp://www.ebi.ac.uk/Tools/services/rest/clustalw2/result/clustalw2-I20110331-115850-0033-1317260-oy/treehttp://www.ebi.ac.uk/Tools/services/rest/clustalw2/result/clustalw2-I20110331-115850-0033-1317260-oy/sequencehttp://www.ebi.ac.uk/Tools/services/rest/clustalw2/result/clustalw2-I20110331-115850-0033-1317260-oy/outhttp://www.ebi.ac.uk/Tools/services/rest/clustalw2/result/clustalw2-I20110331-115850-0033-1317260-oy/aln-clustalwhttp://www.ebi.ac.uk/Tools/services/rest/clustalw2/result/clustalw2-I20110331-115850-0033-1317260-oy/tree
  • 8/7/2019 genomics file

    41/43

    EXPERIMENT NO.7

    AIM :To perform pairwise sequence alignment for two retrieved sequences using BLAST

    In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or

    protein to identify regions of similarity that may be a consequence of functional, structural,

    or evolutionary relationships between the sequences. Aligned sequences of nucleotide or

    amino acid residues are typically represented as rows within a matrix. Gaps are inserted

    between the residues so that identical or similar characters are aligned in successive

    columns. Very short or very similar sequences can be aligned by hand. However, most

    interesting problems require the alignment of lengthy, highly variable or extremely

    numerous sequences that cannot be aligned solely by human effort. Instead, human

    knowledge is applied in constructing algorithms to produce high-quality sequence

    alignments, and occasionally in adjusting the final results to reflect patterns that are

    difficult to represent algorithmically (especially in the case of nucleotide sequences).

    Computational approaches to sequence alignment generally fall into two categories: global

    alignments and local alignments. Calculating a global alignment is a form of global

    optimization that "forces" the alignment to span the entire length of all query sequences.By contrast, local alignments identify regions of similarity within long sequences that are

    often widely divergent overall. In bioinformatics, local alignment is mainly performed

    using the Basic local alignment search tool or BLAST. A BLAST search enables aresearcher to compare a query sequence with a library or database of sequences, and

    identify library sequences that resemble the query sequence above a certain threshold.

    BLAST is one of the most widely used bioinformatics programs[2], because it addresses a

    fundamental problem and the algorithm emphasizes speed over sensitivity. This emphasis

    on speed is vital to making the algorithm practical on the huge genome databases currently

    available, although subsequent algorithms can be even faster. Input sequences in BLAST

    are in FASTA format or Genbank format.

    http://en.wikipedia.org/wiki/BLAST#cite_note-1http://en.wikipedia.org/wiki/BLAST#cite_note-1
  • 8/7/2019 genomics file

    42/43

    BLAST output can be delivered in a variety of formats. These formats include HTML, plain text,

    and XML formatting. For NCBIs web-page, the default format for output is HTML. When

    performing a BLAST on NCBI, the results are given in a graphical format showing the hits

    found, a table showing sequence identifiers for the hits with scoring related data, as well as

    alignments for the sequence of interest and the hits received with corresponding BLAST

    scores for these. Using a heuristic method, BLAST finds homologous sequences, not by

    comparing either sequence in its entirety, but rather by locating short matches between the

    two sequences. This process of finding initial words is called seeding. It is after this first

    match that BLAST begins to make local alignments. While attempting to find homology in

    sequences, sets of common letters, known as words, are very important.The heuristicalgorithm of BLAST locates all common words between the sequence of interest and the

    hit sequence, or sequences, from the database. These results will then be used to build an

    alignment. After making words for the sequence of interest, neighborhood words are also

    assembled. These words must satisfy a requirement of having a score of at least the

    threshold, T, when compared by using a scoring matrix.The threshold score T, determineswhether a particular word will be included in the alignment or not. Once seeding has been

    conducted, the alignment, which is only 3 residues long, is extended in both directions by

    the algorithm used by BLAST. Each extension impacts the score of the alignment by either

    increasing or decreasing it. Should this score be higher than a pre-determined T, the

    alignment will be included in the results given by BLAST. However, should this score be

    lower than this pre-determined T, the alignment will cease to extend, preventing areas of

    poor alignment to be included in the BLAST results.

    PROCEDURE:

    Search for blast on google homepage and click on http://blast.ncbi. nlm. nih.gov/ Blast.cgi?

    CMD=Web&PAGE_TYPE=BlastHome

    Select the BLAST type you want to perform, for instance select nucleotide blast

    Submit the sequence to be searched either in the FASTA format or in the form of NCBI

    accession no.

    Select the database from which sequence is to be searched

    Click on BLAST

  • 8/7/2019 genomics file

    43/43