genomics file

8/7/2019 genomics file

1/43

GENOMICS FILE

SUBMITTED TO:- SUBMITTED BY:-

MS INDU GAUR K.PUNIT PUSHKAR

IMT/07/8037

SECTION S


2/43


3/43

EXPERIMENT NO.1

AIM: To study different websites and database related to genomic research

NCBI-The National Center for Biotechnology Information (NCBI) is part of the United

States National Library of Medicine (NLM), a branch of the National Institutes of Health. The

NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored

by Senator Claude Pepper.The NCBI houses genome sequencing data in GenBank and an index

of biomedical research articles in PubMed Central and PubMed, as well as other information

relevant to biotechnology. All these databases are available online through the Entrez search

engine. The Entrez Global Query Cross-Database Search System is a powerful federated

search engine, or web portal that allows users to search many discrete health sciences databases

at the NCBI website. The NCBI has had responsibility for making available the

GenBank DNA sequence database since 1992. GenBank coordinates with individual laboratories

and other sequence databases such as those of the European Molecular Biology

Laboratory (EMBL) and the DNA Data Bank of Japan (DDBJ).Two major roles of NCBI are to

create research in the field of computational biology and create public databases. GenBank is a

part of International Nucleotide Sequence Database Collaboration along with europes EMBL,

japans DDBJ.

1. Genbank- The GenBanksequence database is an open access, annotated collection of all

publicly available nucleotide sequences and their protein translations. This database is

produced and maintained by the National Center for Biotechnology Information (NCBI)

as part of the International Nucleotide Sequence Database Collaboration (INSDC). Direct

submissions are made to GenBank using BankIt, which is a Web-based form, or the

stand-alone submission program,Sequin. Upon receipt of a sequence submission, the

GenBank staff assigns an accession number to the sequence and performs quality

assurance checks. The submissions are then released to the public database, where the

entries are retrievable by Entrez or downloadable by FTP.


4/43

SEQUENCE SUBMISSION TOOLS include Bankit and Sequin. Bankit is used when we

have a single sequence, a simple set of sequences or a small batch of different sequences.

It is a web-based submission tool. Sequin is a stand-alone software tool developed by the

NCBI for submitting and updating entries to the GenBank, EMBL, or DDBJ sequence

databases. It is capable of handling simple submissions that contain a single short mRNA

sequence, and complex submissions containing long sequences, multiple annotations,

segmented sets of DNA, or phylogenetic and population studies.

2. EMBL- The European Molecular Biology Laboratory (EMBL) is a molecular

biology research institution supported by 20 European countries and Australia as

associate member state. EMBL was created in 1974 and is an intergovernmental

organisation funded by public research money from its member states. The cornerstones

of EMBL's mission are manifold. Basic research in molecular biology and molecular

medicine is performed; scientists, students and visitors at all levels are trained; vital

services to scientists in the member states are offered; new instruments and methods in

the life sciences are developed; and there is an active engagement in technology

transfer.One of the major institutes of Europe that runs EMBL is European

Bioinformatics Institute.

3. DDBJ- The DNA Data Bank of Japan (DDBJ) is a DNA data bank.[1] It is located at

the National Institute of Genetics (NIG) in theShizuoka prefecture of Japan. It is also a

member of the International Nucleotide Sequence Database Collaboration. It exchanges

its data with European Molecular Biology Laboratory at the European Bioinformatics

Institute and with GenBank at the National Center for Biotechnology Information on a

daily basis. Thus these three databanks contents the same data at any given time.
http://en.wikipedia.org/wiki/DNA_Data_Bank_of_Japan#cite_note-pmid11752245-0http://en.wikipedia.org/wiki/DNA_Data_Bank_of_Japan#cite_note-pmid11752245-0


5/43

TYPES OF DATABASES:

NUCLEOTIDE DATABASES

dbEST is a division of Genbank established in 1992. As forGenBank, data in dbEST is directly

submitted by laboratories worldwide and is not curated.

dbSTS is an NCBI resource that contains sequence data for short genomic landmark sequences

or Sequence Tagged Sites. STS sequences are incorporated into the STS Division

of GenBank.The dbSTS database offers a route for submission of STS sequences to GenBank. It

is designed especially for the submission of large batches of STS sequences.

The Single Nucleotide Polymorphism Database (dbSNP) is a free public archive for genetic

variation within and across different species developed and hosted by the National Center for

Biotechnology Information (NCBI) in collaboration with the National Human Genome Research

Institute (NHGRI). Although the name of the database implies a collection of one class of

polymorphisms only (i.e., single nucleotide polymorphisms (SNPs)), it in fact contains a range of

molecular variation: (1) SNPs, (2) short deletion and insertion polymorphisms (indels/DIPs),

(3) microsatellite markers or short tandem repeats (STRs), (4) multinucleotide polymorphisms

(MNPs), (5) heterozygous sequences, and (6) named variants.

The Reference Sequence (RefSeq) database is an open access, annotated and curated collection

of publicly available nucleotide sequences (DNA, RNA) and their protein products. This

database is built by (NCBI), and, unlike GenBank, provides only single record for each natural

biological molecule(i.e. DNA, RNA or protein) for major organisms ranging from viruses to

bacteria to eukaryotes.For each model organism, RefSeq aims to provide separate and linked

records for the genomic DNA, the gene transcripts, and the proteins arising from thosetranscripts. RefSeq is limited to major organisms for which sufficient data is available.

PROTEIN DATABASES
http://en.wikipedia.org/wiki/GenBankhttp://en.wikipedia.org/wiki/GenBankhttp://en.wikipedia.org/wiki/GenBank


6/43

The Protein Data Bank(PDB) is a repository for the 3-D structural data of large biological

molecules, such as proteins and nucleic acids. (See also crystallographic database). The data,

typically obtained by X-ray crystallography or NMR spectroscopy and submitted

by biologists and biochemists from around the world, are freely accessible on the Internet via the

websites of its member organisations (PDBe, PDBj, and RCSB). The PDB is overseen by an

organization called the Worldwide Protein Data Bank, wwPDB.

The Protein Clusters database provides easy access to annotation information, publications,

domains, structures, and external links and analysis tools including multiple alignments,

phylogenetic trees, and genomic neighborhoods (ProtMap).Protein Clusters can be searched like

any other Entrez database.

STRUCTURAL DATABASES

The Conserved Domain Database (CDD) brings together several collections of multiple

sequence alignments representing conserved domains, including NCBI-curated domains, which

use 3D-structure information to explicitly to define domain boundaries and provide insights into

sequence/structure/function relationships, as well as domain models imported from a number

of external source databases. The data are then used for putative functional annotation of protein

query sequences based on matches to specific hits.

The Structural Classification of RNA (SCOR) database provides a survey of the three-

dimensional motifs contained in 259 NMR and X-ray RNA structures. In one classification, the

structures are grouped according to function. The RNA motifs, including internal and external

loops, are also organized in a hierarchical classification.The 259 database entries contain 223

internal and 203 external loops; 52 entries consist of fully complementary duplexes.

GENOME DATABASES

The NCBI Entrez Genome database is a collection of complete large-scale sequencing,

assembly, annotation, and mapping projects for cellular organisms. It contains Genomic

sequences at different stage of finishing from both the public domain sequencing effort and

Celera Genomics, protein function data and gene structure. It helps in understanding the


7/43

genomic organization of genes; mapping a gene, understanding the exon/intron structure of a

gene Searching for genetic and physical markers and accessing comprehensive information

about a gene, its transcript(s) and protein(s), structure, activity, and location.

CHEMICAL DATABASES

PubChem is a database of chemical molecules and their activities against biological assays.

The system is maintained by the NCBI, a component of the National Library of Medicine,

which is part of the United States National Institutes of Health (NIH). PubChem can beaccessed for free through a web user interface.

Chemical Entities of Biological Interest, also known as ChEBI, is a database and ontology of

molecular entities focused on 'small' chemical compounds, that is part of the Open Biomedical

Ontologies effort. The term "molecular entity" refers to any "constitutionally or isotopically

distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer, etc.,

identifiable as a separately distinguishable entity. ChEBI uses nomenclature, symbolism and

terminology endorsed by the International Union of Pure and Applied Chemistry (IUPAC) and

Nomenclature Committee of the International Union of Biochemistry and Molecular Biology.

METABOLIC OR PATHWAY DATABASES

KEGG (Kyoto Encyclopedia of Genes and Genomes) is a collection of online databases

dealing with genomes, enzymatic pathways, and biological chemicals. The PATHWAY

database records networks of molecular interactions in the cells, and variants of them specific

to particular organisms.


8/43

LITERATURE DATABASES

PubMed is a free database accessing primarily the MEDLINE database of references and

abstracts on life sciences and biomedical topics. In addition to MEDLINE, it also provides

access to OLDMEDLINE for pre-1966 records and citations to articles from MEDLINE

journals. Citations may include links to full-text content from PubMed Central and publisher

web sitesMEDLINE (Medical Literature Analysis and Retrieval System Online) is a

bibliographic database of life sciences and biomedical information. It includes bibliographic

information for articles from academic journals covering medicine, nursing, pharmacy,

dentistry, veterinary medicine, and health care. MEDLINE also covers much of the literature

in biology and biochemistry, as well as fields such as molecular evolution.Compiled by the

United States National Library of Medicine (NLM), MEDLINE is freely available on the

Internet and searchable via PubMed and NLM's National Center for Biotechnology

Information's Entrez system.

DISEASE DATABASES

OMIM is a comprehensive, authoritative, and timely compendium of human genes and genetic

phenotypes. The full-text, referenced overviews in OMIM contain information on all known

mendelian disorders and over 12,000 genes. OMIM focuses on the relationship between

phenotype and genotype. It is updated daily, and the entries contain copious links to other

genetics resources.


9/43

EXPERIMENT NO.2

AIM: To study different tools used for genomic research

Phred: Better Base Calling

Phred is a base-calling program for DNA sequence traces. The program was developed by Drs.

Phil Green and Brent Ewing, and is copyrighted by the University of Washington. It is widely

used by the largest academic and commercial sequencing laboratories. It has a high base calling

accuracy with 40-50% lower error rates. The highly accurate error probablilities Phred calculates

for each base enable increase automation of the sequencing process. For example,drastically

lower false positive error rates in mutation detection ,effective quality control immediately after

sequence production, quantitative benchmarking of different sequencing methods and protocol

changes. Phred was developed for the Human Genome Project, where large amounts of sequence

data were processed by automated scripts; therefore, Phred's processing options are set by

command line parameters. For Windows and OS X users who would like to use Phred through

an easy-to-use graphical user interface, we have developed the sequence analysis software

CodonCode Aligner. CodonCode Aligner greatly simplifies using Phred for base calling and

Phrap for sequence assembly,and also offers a number of additional functions often needed in

DNA sequencing projects, for example contig alignment and editing, reference sequence

alignments, and mutation detection.

Phrap: Better Sequence Assemblies

Phrap is a leading program for DNA sequence assembly. Phrap is routinely used in some of thelargest sequencing projects in the Human Genome Sequencing Project and in the biotech

industry. Some of Phrap's feature include:


10/43

Fast assemblies- Assemblies of cosmid- to BAC sized projects with several hundred to two

thousand reads typically take only minutes to complete on high-powered workstations or

personal computers.

Accurate consensus sequences- Phrap uses Phred's quality scores to determine highly accurate

consensus sequences. Phrap examines all individual sequences at a given position, and generally

uses the highest quality sequence to build consensus. Compared to simple majority rules use in

older sequence assembly programs, Phrap's approach can give significantly more accurate

consensus sequences.

Consensus quality estimates- Phrap uses the quality information of individual sequences to

estimate the quality of the consensus sequence. In addition, Phrap uses available information

about sequencing chemistry (dye terminator or dye primer) and confirmation by "other strand"

reads in estimating the consensus quality.

Ability to assemble very large projects- Phrap has been used routinely to assembly bacterial

genomes sequenced by the "shotgun" approach, where each project contained tens of thousands

of reads. Smaller bacterial genomes (2 million bases or less) could often be assembled in less

than three hours.

Improved identification and handling of repeats- Phrap uses quality scores to estimate whether

discrepancies between two overlapping sequences are more likely to arise from random errors, or

from different copies of a repeated sequence. For repeats with 95 to 98% identity (like human

Alu sequences) and high quality sequence data, this typically yields correct assemblies.

Cross match: Fast DNA Sequence Comparisons and Vector Screening

Cross match is a program for fast comparisons of DNA sequences that uses the same algorithms

as Phrap. For example, the comparison of several hundred thousand bases of "raw" sequence to

the sequence of an entire BAC typically takes less than one minute. Within the Phred - Phrap

system, Cross_match is typically used for vector screening. In addition to this, it is also used for


11/43

the Identification of overlaps between contig ends after assembly with Phrap, identification of

potential repeat sequences in assemblies, generation of error summaries and lists after

completion of sequencing projects and estimation of vector contamination in newly created

libraries.

Fgenesh

It is a gene prediction program that falls under ab-initio gene prediction category. This is a HMM

based program that has parameters for finding genes in humans, drosophila, plants, yeast and

nematodes. The program does predict some genes that are not annotated as genes and fails to

predict some genes that do exist. A new program called fgenesh+ which works for a set of

missed genes when information about homologous protein sequences is furnished in fgenesh. It

is better in terms of sensitivity and specificity suggesting that, while ignoring similarity to

cDNAs, ESTs, and protein sequences may be appropriate for analyzing the ab initio part of a

predictor algorithm, for true-life scenarios of predicting genes in newly sequenced eukaryotic

genomes, more genes can be predicted by inclusion of these database sequences.

Glimmer

Glimmer is a system for finding genes in microbial DNA, especially the genomes of bacteria,

archaea, and viruses. Glimmer (Gene Locator and Interpolated Markov ModelER) uses

interpolated Markov models (IMMs) to identify the coding regions and distinguish them from

noncoding DNA. The IMM approach uses a combination of Markov models from 1st through

8th-order, weighting each model according to its predictive power. Glimmer uses 3-periodic

nonhomogenous Markov models in its IMMs.

Glimmer was the primary microbial gene finder used at The Institute for Genomic Research

(TIGR), where it was first developed, and has been used to annotate the complete genomes of

over 100 bacterial species from TIGR and other labs. Glimmer is used as basis for the design of


12/43

glimmer M which includes an algorithm for predicting splice sites. Further improvements to

glimmer M for the purpose of eukaryotic gene prediction resulted in the generation of glimmer

HMM. GlimmerHMM also adds in splice site predictors adapted from the Gene Splicer program.

Grail

The goal of the GRAIL program is to utilize several algorithms detecting different features of a

protein coding gene to predict with high accuracy the position of a gene within a genome.

Originally, GRAIL examined the presence of these several features (discussed below) in a

sliding 99-nucleotide window; however, this biases the program towards prediction of longer

exons and misses a larger number of shorter exons. This bias was later removed by allowing the

program to examine all possible exons, rather than just those in a sliding window. In both cases,

GRAIL utilizes a neural network to combine predictions for all these gene features.GRAIL starts

by scoring a region as protein coding versus protein noncoding based on frequency of 6-mers

that occur often in coding versus noncoding sequences.These coding regions are then scored for

the presence of a start codon, with a stop codon downstream and in-frame. A higher score is

achieved by the presence of these features. The GRAIL algorithm can also identify frameshift

mutations (insertions or deletions) that may be introduced do to errors during sequencing, by

determining when a shift in frame occurs in a region with high coding potential, creating an out-

of-frame stop codon. Splice sites are also detected, by analysis of the coding region surrounding

splice donor sequences and splice acceptor sequences. GRAIL also scores CpG islands, which

are underrepresented in the genome but enriched just 5 of coding regions, the presence of a

TATA box, and the polyadenylation signal.


13/43

EXPERIMENT NO.3

AIM: To Study DNA sequencing methods

The term DNA sequencing is the use of sequencing for determining the order of the nucleotide

basesadenine, guanine, cytosine, and thyminein a molecule of DNA.Knowledge of DNA

sequences has become indispensable for basic biological research, other research branches

utilizing DNA sequencing, and in numerous applied fields such as diagnostic, biotechnology,

forensic biology and biological systematics. The advent of DNA sequencing has significantly

accelerated biological research and discovery. The rapid speed of sequencing attained with

modern DNA sequencing technology has been instrumental in the sequencing of the human

genome, in the Human Genome Project.

MaxamGilbert sequencing

In 19761977, Allan Maxam and Walter Gilbert developed a DNA sequencing method based on

chemical modification of DNA and subsequentcleavage at specific sites.The method requires

radioactive labeling at one 5' end of the DNA (typically by a kinase reaction using gamma-32P

ATP) and purification of the DNA fragment to be sequenced. Chemical treatment generates

breaks at a small proportion of one or two of the four nucleotide bases in each of four reactions

(G, A+G, C, C+T). For example, the purines (A+G) are depurinated using formic acid, the

guanines (and to some extent the adenines) are methylated by dimethyl sulfate, and the

pyrimidines (C+T) are methylated using hydrazine. The addition of salt (sodium chloride) to the

hydrazine reaction inhibits the methylation of thymine for the C-only reaction. The modified

DNAs are then cleaved by hot piperidine at the position of the modified base. The concentration

of the modifying chemicals is controlled to introduce on average one modification per DNA

molecule. Thus a series of labeled fragments is generated, from the radiolabeled end to the first

"cut" site in each molecule. The fragments in the four reactions are electrophoresed side by side


14/43


15/43

DNA bands are then visualized by autoradiography or UV light, and the DNA sequence can be

directly read off the X-ray film or gel image. In the image on the right, X-ray film was exposed

to the gel, and the dark bands correspond to DNA fragments of different lengths. A dark band in

a lane indicates a DNA fragment that is the result of chain termination after incorporation of a

dideoxynucleotide (ddATP, ddGTP, ddCTP, or ddTTP). The relative positions of the different

bands among the four lanes are then used to read (from bottom to top) the DNA sequence.

Automated DNA sequencing

Automated DNA-sequencing instruments (DNA sequencers) can sequence up to 384 DNA

samples in a single batch (run) in up to 24 runs a day. DNA sequencers carry out capillary

electrophoresis for size separation, detection and recording of dye fluorescence, and data output

as fluorescent peak trace chromatograms. Sequencing reactions bythermocycling, cleanup and

re-suspension in a buffer solution before loading onto the sequencer are performed separately. A

number of commercial and non-commercial software packages can trim low-quality DNA traces

automatically. These programs score the quality of each peak and remove low-quality base peaks

(generally located at the ends of the sequence)
http://en.wikipedia.org/wiki/DNA_sequencershttp://en.wikipedia.org/wiki/Capillary_electrophoresishttp://en.wikipedia.org/wiki/Capillary_electrophoresishttp://en.wikipedia.org/wiki/Chromatogramhttp://en.wikipedia.org/wiki/Thermocyclerhttp://en.wikipedia.org/wiki/Thermocyclerhttp://en.wikipedia.org/wiki/Buffer_solutionhttp://en.wikipedia.org/wiki/DNA_sequencershttp://en.wikipedia.org/wiki/Capillary_electrophoresishttp://en.wikipedia.org/wiki/Capillary_electrophoresishttp://en.wikipedia.org/wiki/Chromatogramhttp://en.wikipedia.org/wiki/Thermocyclerhttp://en.wikipedia.org/wiki/Buffer_solution


16/43

EXPERIMENT NO.4

AIM: To visualize the macromolecular structure of proteins using RASMOL

RasMol is a computer program written for molecular graphics visualization intended and used

primarily for the depiction and exploration of biological macromolecule structures, such as

those found in the Protein Data Bank. It was originally developed by Roger Sayle in the

early 90s. Maintenance of RasMol, much of the development, and integration of

modifications provided by the community is done at the ARCiB laboratory at Dowling

College. Work on RasMol has been supported in part by grants from the U.S. Department

of Energy, the U.S. National Science Foundation and the U.S. NIH National Institute of

General Medical Sciences. RasMol 2.7.5 runs on wide range of architectures and operating

systems including Microsoft Windows, Apple Macintosh, UNIX and VMS systems. UNIX

and VMS versions require an 8, 24 or 32 bit colour X Windows display (X11R4 or later).

The X Windows version of RasMol [rasmol2.7.5.exe] provides optional support for a

hardware dials box and accelerated shared memory communication (via the XInput and

MIT-SHM extensions) if available on the current X Server.

The program reads in a molecule coordinate file and interactively displays the molecule on

the screen in a variety of colour schemes and molecule representations. Currently available

representations include depth-cued wireframes, `Dreiding` sticks, spacefilling (CPK)spheres, ball and stick, solid and strand molecular ribbons, atom labels and dot surfaces.

PROCEUDRE FOR VISUALISATION:

Browse for Rasmol V 2.7.5 README in google search.


17/43

Download Rasmol V 2.7.5 windows installer and save it.

Open the pdb website(www.pdb.org) and type the pdb id or text search of the complete structure

file of the protein of interest.(haemoglobin in this case)

Download the file entitled structure of human deoxy hemoglobin A in complex with

xenon.

Open the structure file with th e help of rasmol and analyze its sequence with the help of

functions available in the rasmol.
http://www.pdb.org/http://www.pdb.org/


18/43


19/43


20/43

EXPERIMENT NO.5

AIM: To perform gene structure prediction using Genscan

Genscan is a bioinformatics software. Its mainsail function is to acquire a DNA sequence and

find the ORF that accord to genes.Genscan was formulated by Dr. Chris Burg who is

currently working on his thesis. This program is not only used to predict genes in a

sequenced set of DNA, it can also be used to determine a specific sequence using measures

of the percentage of C+G content. There are two approaches followed by Genscan for gene

prediction.

Statistical patterns identification-this approach of gene prediction uses all purpose knowledge

abour gene structure.Knowledge of gene structure includes promoter region, start and end

sequences of intron and exon,etc.

Sequence similarity comparision- this approach is based on similarity which takes advantage of

the fact that if the sequence is similar to the one with which it is being compared, it will

have the same function. But the structure of gene cannot be predicted accurately based on

sequence information alone.


21/43

For large scale analysis of gene, the typical strategy is to completely inactivate each gene or over

express it. In each case, however, the resulting phenotype may not be informative. Genscan

uses two tyepes of signal models to model different functional units. A weight matrix

model is used for modeling promoter, polyadenylation signals, transcription initiation and

termination signals. A modified version of the weighted array model is used for modeling

acceptor splice sites. After the prediction of gene structure, its function and expression

level can be investigated. Genscan can also identify disease severity.

PROCEDURE:

Search for Genscan on google and select genes.mit.edu/GENSCAN.html

Now go to NCBIs homepage and search for chromosome 14 under genome databases option.

Select the entire sequence or a part of it and paste it under the input option on the genscan

homepage

Fill in the entries according to the requirements of experiment and click on RUN.

1) GENSCAN 1.0 ru31-1

EXPERIMENT NO.6

AIM:To perform multiple sequence alignment using CLUSTALW algorithm

The sensitivity of the commonly used progressive multiple sequence alignment method has been

greatly improved for the alignment of divergent protein sequences. Firstly, individual weights are

assigned to each sequence in a partial alignment in order to downweight near-duplicate

sequences and upweight the most divergent ones. Secondly, amino acid substitution matrices are

varied at different alignment stages according to the divergence of the sequences to be aligned.

Thirdly, residue specific gap penalties and locally reduced gap penalties in hydrophilic regions

encourage new gaps in potential loop regions rather than regular secondary structure. Fourthly,

positions in early alignments where gaps have been opened receive locally reduced gap penalties

to encourage the opening up of new gaps at these positions. These modifications are incorporated

into a new program, CLUSTALW. ClustalW2 is a general purpose multiple sequence alignment

program for DNA or proteins. The basic multiple alignment algorithm consists of three main


22/43

stages: 1) all pairs of sequences are aligned separately in order to calculate a distance matrix

giving the divergence of each pair of sequences; 2) a guide tree is calculated from the distance

matrix; 3) the sequences are progressively aligned according to the branching order in the guide

tree.

PROCEDURE: On google webpage, search for CLUSTALW, Select the required page and

follow the steps:

Step 1 - Sequence

Sequence Input Window

Three or more sequences to be aligned can be entered directly into this form. Sequences can be

be in GCG, FASTA, EMBL, GenBank, PIR, NBRF, PHYLIP or UniProtKB/Swiss-Prot format.

Partially formatted sequences are not accepted. Adding a return to the end of the sequence may

help certain applications understand the input. Note that directly using data from word processors

may yield unpredictable results as hidden/control characters may be present.

Sequence File Upload

A file containing three or more valid sequences in any format (GCG, FASTA, EMBL, GenBank,

PIR, NBRF, PHYLIP or UniProtKB/Swiss-Prot) can be uploaded and used as input for the

multiple sequence alignment. Word processor files may yield unpredictable results as

hidden/control characters may be present in the files. It is best to save files with the Unix format

option to avoid hidden Windows characters.

Sequence Type

Indicates if the sequences to align are protein or nucleotide (DNA/RNA).

Type Abbreviation

Protein protein

DNA dna


23/43

Default value is: Protein [protein]

Step 2 - Pairwise Alignment Options

Alignment Type

The alignment method used to perform the pairwise alignments used to generate the guide tree.

Output

FormatDescription Abbreviation

slow Slow, but accurate slow

fast Fast, but approximate fast

Default value is: slow

Protein Weight Matrix (PW)

Slow pairwise alignment protein sequence comparison matrix series used to score alignment.

Matrix (Protein Only) Description Abbreviation

BLOSUM blosum

PAM pam

Gonnet gonnet

ID id

Default value is: Gonnet [gonnet]

DNA Weight Matrix (PW)

Slow pairwise alignment nucleotide sequence comparison matrix used to score alignment.


IUB iub


24/43


ClustalW clustalw

Default value is: IUB [iub]

Gap Open (PW)

Slow pairwise alignment score for the first residue in a gap.

Default value is: 10

Gap Extension (PW)

Slow pairwise alignment score for each additional residue in a gap.

Default value is: 0.1

KTUP

Fast pairwise alignment word size used to find matches between the sequences. Decrease for

sensitivity; increase for speed.

Default value is: 1

Window Length

Fast pairwise alignment window size for joining word matches. Decrease for speed; increase for

sensitivity.

Default value is: 5

Score Type

Fast pairwise alignment score type to output.


25/43

Order Description Abbreviation

percent Percent

absolute Absolute

Default value is: percent

Top Diags

Fast pairwise alignment number of match regions are used to create the pairwise alignment.

Decrease for speed; increase for sensitivity.

Default value is: 5

Pair Gap

Fast pairwise alignment gap penalty for each gap created.

Default value is: 3

Step 3 - Multiple Sequence Alignment Options

Protein Weight Matrix

Multiple alignment protein sequence comparison matrix series used to score the alignment.


BLOSUM blosum

PAM pam

Gonnet gonnet

ID id


26/43

Default value is: Gonnet [gonnet]

DNA Weight Matrix

Multiple alignment nucleotide sequence comparison matrix used to score the alignment.


IUB iub

ClustalW clustalw

Default value is: IUB [iub]

Gap Open

Multiple alignment penalty for the first residue in a gap.

Default value is: 10

Gap Extension

Multiple alignment penalty for each additional residue in a gap.

Default value is: 0.20

Gap Distances

Multiple alignment gaps that are closer together than this distance are penalised.

Default value is: 5

No End Gaps

Multiple alignment disable the gap seperation penalty when scoring gaps the the ends of the

alignment


27/43


no False

yes True

Default value is: no [false]

Iteration

Multiple alignment improvement iteration type


none No iteration none

tree Iteration at each step of alignment process tree

alignment Iteration only on final alignment alignment

Default value is: none

Num Iter

Maximum number of iterations to perform

Default value is: 1

Clustering

Clustering type.


NJ Neighbour-joining (Saitou and Nei 1987) NJ

UPGMA UPGMA clustering UPGMA

Default value is: NJ


28/43

Output

Format for generated multiple sequence alignment.


Aln w/numbers ClustalW alignment format with base/residue numbering aln1

Aln wo/numbers ClustalW alignment format without base/residue numbering aln2

GCG MSF GCG Multiple Sequence File (MSF) alignment format Gcg

PHYLIP PHYLIP interleaved alignment format Phylip

NEXUS NEXUS alignment format Nexus

NBRF/PIR NBRF or PIR sequence format Pir

GDE GDE sequence format Gde

Pearson/FASTA Pearson or FASTA sequence format Fasta

Default value is: Aln w/numbers [aln1]

Order

The order in which the sequences appear in the final alignment


aligned Determined by the guide tree aligned

input Same order as the input sequences input

Default value is: aligned


29/43

Step 4 - Submission

Job title

It's possible to identify the tool result by giving it a name. This name will be associated to the

results and might appear in some of the graphical representations of the results.

Email Notification

Running a tool is usually an interactive process, the results are delivered directly to the browser

when they become available. Depending on the tool and its input parameters, this may take quite

a long time. It's possible to be notified by email when the job is finished by simply ticking the

box "Be notified by email". An email with a link to the results will be sent to the email address

specified in the corresponding text box. Email notifications require valid email addresses.

Email Address

If email notification is requested, then a valid Internet email address in the

[email protected] must be provided. This is not required when running the tool interactively

(The results will be delivered to the browser window when they are ready).
mailto:[email protected]:[email protected]:[email protected]


30/43

CLUSTAL 2.1 multiple sequence alignment

gi|166362739|ref|NM_001992.3|

AGAGACTCTCACTGCACGCCGGAGGGCGCCCTTCCTCGCTCGCGCCCGCG 50gi|133892391|ref|NM_010169.3|

--------------------------------------------------

gi|166362739|ref|NM_001992.3|

CGACCGCGCGCCCCAGTCCCGCCCCGCCCCGCTAACCGCCCCAGACACAG 100gi|133892391|ref|NM_010169.3| ------------------------------GCTA-----

CTCAGAAA--- 12**** *

**** *

gi|166362739|ref|NM_001992.3|

CGCTCGCCGAGGGTCGCTTGGACCCTGATCTTACCCGTGGGCACCCTGCG 150gi|133892391|ref|NM_010169.3| --------GAAG------TAGGC---GA------CGGCGGGCGCC----- 34

** * * * * ** * * ****

**

gi|166362739|ref|NM_001992.3|

CTCTGCCTGCCGCGAAGACCGGCTCCCCGACCCGCAGAAGTCAGGAGAGA 200gi|133892391|ref|NM_010169.3| ---------------GGGCCG-----------

CGC--------------- 43* *** ***

gi|166362739|ref|NM_001992.3|

GGGTGAAGCGGAGCAGCCCGAGGCGGGGCAGCCTCCCGGAGCAGCGCCGC 250

gi|133892391|ref|NM_010169.3|-------------------------GGGCAGCCTT--------------- 53

*********

gi|166362739|ref|NM_001992.3|

GCAGAGCCCGGGACAATGGGGCCGCGGCGGCTGCTGCTGGTGGCCGCCTG 300gi|133892391|ref|NM_010169.3|

---------GGGACAATGGGGCCCCGGCGCTTGCTGATCGTCGCCCTCGG 94************** ***** ***** * **

*** * *

gi|166362739|ref|NM_001992.3|

CTTCAGTCTGTGCGGCCCGCTGTTGTCTGCCCGCACCCGGGCCCGCAGGC 350

gi|133892391|ref|NM_010169.3|CCTCAGCCTGTGCGGTCCCTTGCTGTCTTCCCGCGTCCCTATGAGCCAGC 144

* **** ******** ** ** ***** ***** **** **

gi|166362739|ref|NM_001992.3|

CAGAATCAAAAGCAACAAATGCCACCTTAGATCCCCGGTCATTTCTTCTC 400gi|133892391|ref|NM_010169.3|

CAGAATCAGAGAGGACAGATGCTACGGTGAACCCCCGCTCATTCTTTCTA 194


31/43

******** * *** **** ** * * *****

***** ****

gi|166362739|ref|NM_001992.3| AGGAACCCCAATGATAA---

ATATGAACCATTTT------------GGGA 435gi|133892391|ref|NM_010169.3|

AGGAATCCCAGTGAAAATACATTTGAACTGGTCCCCCTGGGGGATGAGGA 244***** **** *** ** ** ***** *

***

gi|166362739|ref|NM_001992.3|GGATGAGGAGAAAAATGAAAGTGGGTTAACTGAATACAGATTAGTCTCCA 485

gi|133892391|ref|NM_010169.3|GGAGGAGGAGAAAAATGAAAGCGTCCTGCTGGAGGGTAGGGCAGTCTACT 294

*** ***************** * * ** **

***** *

gi|166362739|ref|NM_001992.3|

TCAATAAAAGCAGTCCTCTTCAAAAACAACTTCCTGCATTCATCTCAGAA 535gi|133892391|ref|NM_010169.3|

TAAATATAAGCCTCCCTCCTCACACGCCGCCTCCTCCCTTCATCTCCGAG 344* **** **** **** *** * * * **** *

******** **

gi|166362739|ref|NM_001992.3|GATGCCTCCGGATATTTGACCAGCTCCTGGCTGACACTCTTTGTCCCATC 585

gi|133892391|ref|NM_010169.3|GACGCCTCCGGATATCTGACCAGCCCCTGGCTGACGCTCTTCATGCCCTC 394

** ************ ******** ********** *****

* ** **

gi|166362739|ref|NM_001992.3|

TGTGTACACCGGAGTGTTTGTAGTCAGCCTCCCACTAAACATCATGGCCA 635gi|133892391|ref|NM_010169.3|

CGTGTACACGATTGTGTTCATTGTCAGCCTTCCTCTGAACGTCCTGGCCA 444******** ***** * ******** ** ** ***

** ******

gi|166362739|ref|NM_001992.3|TCGTTGTGTTCATCCTGAAAATGAAGGTCAAGAAGCCGGCGGTGGTGTAC 685

gi|133892391|ref|NM_010169.3|TCGCAGTGTTCGTCTTGAGGATGAAGGTCAAGAAGCCGGCCGTGGTGTAC 494

*** ****** ** *** *****************************

gi|166362739|ref|NM_001992.3|

ATGCTGCACCTGGCCACGGCAGATGTGCTGTTTGTGTCTGTGCTCCCCTT 735

gi|133892391|ref|NM_010169.3|ATGCTGCACCTGGCCATGGCCGACGTGCTCTTCGTGTCGGTGCTCCCCTT 544

**************** *** ** ***** ** *****

***********

gi|166362739|ref|NM_001992.3|TAAGATCAGCTATTACTTTTCCGGCAGTGATTGGCAGTTTGGGTCTGAAT 785

gi|133892391|ref|NM_010169.3|CAAGATCAGCTACTACTTCTCCGGCACTGATTGGCAGTTCGGGTCTGGAA 594


32/43

*********** ***** ******* ************

******* *

gi|166362739|ref|NM_001992.3|

TGTGTCGCTTCGTCACTGCAGCATTTTACTGTAACATGTACGCCTCTATC 835gi|133892391|ref|NM_010169.3|

TGTGCCGCTTCGCCACCGCAGCGTTTTACGGGAACATGTACGCCTCCATC 644**** ******* *** ***** ****** *

************** ***

gi|166362739|ref|NM_001992.3|TTGCTCATGACAGTCATAAGCATTGACCGGTTTCTGGCTGTGGTGTATCC 885

gi|133892391|ref|NM_010169.3|ATGCTCATGACGGTCATAAGCATTGACCGGTTCCTGGCGGTGGTGTATCC 694

********** ******************** *****

***********

gi|166362739|ref|NM_001992.3|

CATGCAGTCCCTCTCCTGGCGTACTCTGGGAAGGGCTTCCTTCACTTGTC 935gi|133892391|ref|NM_010169.3|

GATCCAGTCCCTGTCCTGGCGCACTCTGGGCCGAGCCAACTTCACTTGCG 744** ******** ******** ******** * **

*********

gi|166362739|ref|NM_001992.3|TGGCCATCTGGGCTTTGGCCATCGCAGGGGTAGTGCCTCTGCTCCTCAAG 985

gi|133892391|ref|NM_010169.3|TGGTCATTTGGGTGATGGCCATCATGGGGGTGGTGCCCCTTCTCCTCAAG 794

*** *** **** ******** ***** ***** **

*********

gi|166362739|ref|NM_001992.3|

GAGCAAACCATCCAGGTGCCCGGGCTCAACATCACTACCTGTCATGATGT 1035gi|133892391|ref|NM_010169.3|

GAGCAGACCACCCGAGTTCCGGGACTCAACATCACCACCTGCCACGACGT 844***** **** ** ** ** ** *********** *****

** ** **

gi|166362739|ref|NM_001992.3|GCTCAATGAAACCCTGCTCGAAGGCTACTATGCCTACTACTTCTCAGCCT 1085

gi|133892391|ref|NM_010169.3|CCTCAGTGAGAACCTGATGCAAGGCTTTTACTCGTACTACTTCTCGGCCT 894

**** *** * **** * ****** ** ************ ****

gi|166362739|ref|NM_001992.3|

TCTCTGCTGTCTTCTTTTTTGTGCCGCTGATCATTTCCACGGTCTGTTAT 1135

gi|133892391|ref|NM_010169.3|TCTCCGCCATCTTCTTTCTTGTGCCGTTGATCGTTTCCACGGTCTGCTAC 944

**** ** ******** ******** *****

************* **

gi|166362739|ref|NM_001992.3|GTGTCTATCATTCGATGTCTTAGCTCTTCCGCAGTTGCCAACCGCAGCAA 1185

gi|133892391|ref|NM_010169.3|ACGTCCATCATCCGGTGCCTGAGCTCCTCCGCGGTTGCCAACCGGAGCAA 994


33/43

*** ***** ** ** ** ***** *****

*********** *****

gi|166362739|ref|NM_001992.3|

GAAGTCCCGGGCTTTGTTCCTGTCAGCTGCTGTTTTCTGCATCTTCATCA 1235gi|133892391|ref|NM_010169.3|

GAAGTCGCGGGCTTTGTTCCTGTCTGCCGCGGTGTTCTGCATCTTCATCG 1044****** ***************** ** ** **

***************

gi|166362739|ref|NM_001992.3|TTTGCTTCGGACCCACAAACGTCCTCCTGATTGCGCATTACTCATTCCTT 1285

gi|133892391|ref|NM_010169.3|TCTGCTTTGGGCCCACCAACGTCCTCCTGATTGTGCACTACCTTTTCCTC 1094

* ***** ** ***** **************** *** ***

*****

gi|166362739|ref|NM_001992.3|

TCTCACACTTCCACCACAGAGGCTGCCTACTTTGCCTACCTCCTCTGTGT 1335gi|133892391|ref|NM_010169.3|

TCCGACAGTCCTGGTACAGAGGCAGCCTACTTTGCTTACCTCCTCTGCGT 1144** *** * * ******** ***********

*********** **

gi|166362739|ref|NM_001992.3|CTGTGTCAGCAGCATAAGCTGCTGCATCGACCCCCTAATTTACTATTACG 1385

gi|133892391|ref|NM_010169.3|CTGTGTGAGCAGCGTGAGCTGCTGCATCGATCCGTTGATTTACTACTACG 1194

****** ****** * ************** ** *

******** ****

gi|166362739|ref|NM_001992.3|

CTTCCTCTGAGTGCCAGAGGTACGTCTACAGTATCTTATGCTGCAAAGAA 1435gi|133892391|ref|NM_010169.3|

CCTCCTCCGAGTGCCAGAGGCACCTCTACAGCATCTTGTGCTGCAAAGAA 1244* ***** ************ ** ******* *****

************

gi|166362739|ref|NM_001992.3|AGTTCCGATCCCAGCAGTTATAACAGCAGTGGGCAGTTGATGGCAAGTAA 1485

gi|133892391|ref|NM_010169.3|AGCTCTGATCCCAACAGTTGCAACAGCACCGGCCAGCTGATGCCGAGTAA 1294

** ** ******* ***** ******* ** *** ****** *****

gi|166362739|ref|NM_001992.3|

AATGGATACCTGCTCTAGTAACCTGAATAACAGCATATACAAAAAGCTGT 1535

gi|133892391|ref|NM_010169.3|AATGGATACCTGCTCTAGTCACCTGAATAACAGCATATACAAAAAGCTAT 1344

*******************

**************************** *

gi|166362739|ref|NM_001992.3| TAACTTAGGAAAAGGGACTGCTGGGAGGTTAAA-AAGAAAAGTTTATAAA 1584

gi|133892391|ref|NM_010169.3| TAGCTTAGGGAAAGGG-TTGCTGGAAGGTTCCATGAGAAAAGGTTG-GAA 1392


34/43

** ****** ****** ****** ***** * *******

** **

gi|166362739|ref|NM_001992.3| AGTGAATAACCTGAGGATTCTATTAGTCCCCACCCA-

AACTTTATTGA-T 1632gi|133892391|ref|NM_010169.3| AGCCAACAGCG-

GGGAATCCCATTAGTCCCTGCAAAGAACTGTATTTACT 1441** ** * * * * ** * ********* * * ****

**** * *

gi|166362739|ref|NM_001992.3| TCACCTCCTAAAA--CAACAGATGTACGACTTGCATACCTGCTTTTTATG 1680

gi|133892391|ref|NM_010169.3|TCGAAACCTAAAAAACAACCAATATCCGATATGCACGAATACTTCT---- 1487

** ******* **** ** * *** **** *

*** *

gi|166362739|ref|NM_001992.3|

GGAGCTGTCAAGCATGTATTTTTGTCAATTACCAGAAAGATAACAGGAC- 1729gi|133892391|ref|NM_010169.3|

---GCTATCAAGAGTCTAGATTGGATAATTACCAGCAAGGTGACGGGAAC 1534*** ***** * ** ** * ********* *** *

** ***

gi|166362739|ref|NM_001992.3|-GAGATGACGGTGTTATTCCAAGGGAATATTGCCAATGCTACAGTAATAA 1778

gi|133892391|ref|NM_010169.3| GGAAATAAAGGTGT----CCAG-----TGTTGCTAGTGCTATGATAGTAA 1575

** ** * ***** *** * **** * *****

** ***

gi|166362739|ref|NM_001992.3|

ATGAATGTCACTTCTGGATATAGCTAGGTGACATATACATACTTACATGT 1828gi|133892391|ref|NM_010169.3| CTGGATGTCACTTCTT-ATATATCTAGGTGAC---------

TTTA----- 1610** *********** ***** *********

***

gi|166362739|ref|NM_001992.3| GTGTATATGTAGATG-TATGCACACACATATATTATTTGCAGTGCAGTAT 1877

gi|133892391|ref|NM_010169.3| ----ATATATAGATGGTATGCACACAC-----TCATTTGTCATGCAGGAG 1651

**** ****** *********** * ********** *

gi|166362739|ref|NM_001992.3|

AGAATAGGCACTTTAAAACACTCTTTCCCCGCACCCCAGCAATTA---TG 1924

gi|133892391|ref|NM_010169.3| GGAATCTGCACTTTGACACA-TTTTTGTTTATTCCCTGGCCGTTACTATG 1700

**** ******* * *** * *** *** **

*** **

gi|166362739|ref|NM_001992.3|AAAATAATCTCTGATTCCCTGATTTAATATGCAAAGTCTAGGTTGGTAGA 1974

gi|133892391|ref|NM_010169.3| GAAATAATCT--GATTCTCTGACTTAATAAACAAAGTCTGAGTTGGTGGG 1748


35/43

********* ***** **** ****** ********

****** *

gi|166362739|ref|NM_001992.3|

GTTTAGCCCTGAACATTTCATGGTGTTCATCAACAGTGAGAGACTCCATA 2024gi|133892391|ref|NM_010169.3| TGTTAGCACTGGGCAGCTGGAGATCCTAAT-

GATAGGGGAGGAGTCCGTA 1797***** *** ** * * * * ** * ** *

** *** **

gi|166362739|ref|NM_001992.3| GTTTGGGCTTG-TACCACTTTTGCAAATAAGTGTATTTTGAAATTGTTTG 2073

gi|133892391|ref|NM_010169.3| GTTTAGACTTAACACAGCTTTTGCCTATA--TTTTTTTTCAAATTATTTG 1845

**** * *** ** ******* *** * * ****

***** ****

gi|166362739|ref|NM_001992.3|

ACGGCAAGGTTTAAGTTATTAAGAGGTAAGACTTAGTACTATCTGTGC-G 2122gi|133892391|ref|NM_010169.3| ATAATAATGGTTA-GTGATGGAAGGATGAGAC--

AGTATTACCTGTGTAG 1892* ** * *** ** ** * * * **** **** **

***** *

gi|166362739|ref|NM_001992.3|TAGAAGTTCTAGTGTTTTCAATTTTAAACATATCCAAGTTTGAATTCCTA 2172

gi|133892391|ref|NM_010169.3|GGGAAGCTCTAATACTTTTCATCTTGAACATACCGTAGTTTTAA------ 1936

**** **** * *** ** ** ****** * *****

**

gi|166362739|ref|NM_001992.3|

AAATTATGGAAACAGATGAAAAGCCTCTGTTTTGATATGGGTAGTATTTT 2222gi|133892391|ref|NM_010169.3| GAATTATCAAGGCTGTTGGAAAACCC--

GTTTTGATATGGGTAGCATTTT 1984****** * * * ** *** **

**************** *****

gi|166362739|ref|NM_001992.3| TT---------ACATTTTACACACTGTACACATAAGCCAAAACTGAGCAT 2263

gi|133892391|ref|NM_010169.3|TTTTTTAACTTGCAATTTACTTACTGAATACATGGACCAAGACTGAGCAT 2034

** ** ***** **** * **** *************

gi|166362739|ref|NM_001992.3| AAGTCCT-

CTAGTGAATGTAGGCTGGCTTTCAGAGTAGGCTATTCCTGAG 2312

gi|133892391|ref|NM_010169.3| AAGACTCACCAG-GACTGTAATAAACCTTACAAAGCAG-CCAAGCCT--- 2079

*** * * ** ** **** *** ** ** ** * *

***

gi|166362739|ref|NM_001992.3|AGCTGCATGTGTCCGCCCCCGATGGAGGACTCCAGGCAGCAGACACATGC 2362

gi|133892391|ref|NM_010169.3| AGACACAGCCATCTGC-----ATGGAGGCCTCTGAGCACCAGGTACAT-- 2122


36/43

** ** ** ** ******* *** *** ***

****

gi|166362739|ref|NM_001992.3|

CAGGGCCATGTCAGACACAGATTGGCCAGAAACCTTCCTGCTGAGCCTCA 2412gi|133892391|ref|NM_010169.3| CACACCCCT------------TCGGCTATG---

CCTCCCAGAGAGC---- 2153** ** * * *** * * ***

****

gi|166362739|ref|NM_001992.3|CAGCAGTGAGACTGGGGCCACTACATTTGCTCCATCCTCCTGGGATT--- 2459

gi|133892391|ref|NM_010169.3| -AGAGATG-GATGGGAAGCACCAGGCCCACCCCATCCTGCTAGGATTCTC 2201

** ** ** ** *** * * ******* **

*****

gi|166362739|ref|NM_001992.3|

---GGCTGTGAACTGATCATGTTTATGAGAAACTGGCAAAGCAGAATGTG 2506gi|133892391|ref|NM_010169.3|

ATTAGCTGTGAGCTGACTGTGTCTTTTAGAAATTGGCAAGGTAAGGTATG 2251******* **** *** * * ***** ****** *

* * **

gi|166362739|ref|NM_001992.3|ATATCCTAGGAGGTAATGACCATGAAAGACTTCTCTACCCATCTTAAAAA 2556

gi|133892391|ref|NM_010169.3|CCATCTTGGGAGGCAGTAACTATGAAAGACT------------------- 2282

*** * ***** * * ** **********

gi|166362739|ref|NM_001992.3|CAACGAAAGAAGGCATGGACTTCTGGATGCCCATCCACTGGGTGTAAACA 2606

gi|133892391|ref|NM_010169.3| -GACGAGAGGAGAAA-------------------------GGTGTGTTTA 2306

**** ** ** ****** *

gi|166362739|ref|NM_001992.3|

CATCTAGTAGTTGTTCTGAAATGTCAGTTCTGATATGGAAGCACCCATT- 2655gi|133892391|ref|NM_010169.3|

CATCCAGTAGCTGTCCTGCAAGGCTGGCCCTTGCACAGACAGACACACCC 2356**** ***** *** *** ** * * ** * **

** **

gi|166362739|ref|NM_001992.3| ATGCGCTGTGGCCACTCCAATAGGTGCTGAG---TGTACAGAGT---GGA 2699

gi|133892391|ref|NM_010169.3|

ACATGCCCTGGTCACACTGTTGGATAGTGGGCCATAGACTGACTATAGGA 2406* ** *** *** * * * * ** * * ** **

* ***

gi|166362739|ref|NM_001992.3| ATAAGACAGAGACCTGCCCTCAA--

GAGCAAAGTAGA------------- 2734gi|133892391|ref|NM_010169.3|

GAATAACCGAGTCCTGTCCTTACTCAGGCAACGCAGAGAGCTGGCATGTG 2456* ** *** **** *** * **** * ***


37/43

gi|166362739|ref|NM_001992.3| --------TCATGCATAGAG----TGT-----

GATGTATGTGTAATAAAT 2767

gi|133892391|ref|NM_010169.3|GTCAGCTATGATGCACATAGAACTTGTCTTCAGCTGGATGTG-ACCAAGT 2505

* ***** * ** *** * ** ****** ** *

gi|166362739|ref|NM_001992.3|

ATGTTTCACACAAACAAGGCCTGTCAGCTAAAGAAGTTTGAACATTTGGG 2817

gi|133892391|ref|NM_010169.3|

GTATTTCACATAAGCAAGGCCTATCAGCTAAACTGCTTTGCATATCTGAG 2555* ******* ** ******** ********* **** *

** ** *

gi|166362739|ref|NM_001992.3|

TTACTATTTCTTGTGGTTATAACTTAATGAAAACAATGCAGTACAGGACA 2867

gi|133892391|ref|NM_010169.3| TTTCTGCTTCCAGTAGCTATAGATTAG-GATAAAAACACAGTATAAGATG 2604

** ** *** ** * **** *** ** ** ******* * **

gi|166362739|ref|NM_001992.3| TATATTTTTTAAA-ATAAGTCT---GATTTA----

ATTGGGCACTATTTA 2909

gi|133892391|ref|NM_010169.3|

TATATTTTTAATACATATGCCCTTCAGCCTACAAAATTACACACTATTTA 2654********* * * *** * * ** ***

*********

gi|166362739|ref|NM_001992.3|

TTTACAAATGTTTTGCTCAATAGATTGCTCAAATCAGGTTTTCTTTTAAG 2959

gi|133892391|ref|NM_010169.3| TTTACAAATGTTTT-TTCAA-AAATTACTCAAATCAG--------CCAGG 2694

************** **** * *** *********** *

gi|166362739|ref|NM_001992.3|

AATCAATCATGTCAGTCTGCTTAGAAATAACAGAAGAAAATAGAATTGAC 3009

gi|133892391|ref|NM_010169.3| CAT----TATGGTATACACCTT-----

TAATCCCAGAACTTGGGA--GGC 2733** *** * * *** *** **** *

* * * *

gi|166362739|ref|NM_001992.3|ATTGAAATCTAGGAAAATTATTCTATAATTTCCATTTACTTAAGACTTAA 3059

gi|133892391|ref|NM_010169.3| A--GAGG--CAGGCAGATC-TTAAACAATTT---TTTTTTTAAGAAACAA 2775

* ** *** * ** ** * ***** ***

****** **

gi|166362739|ref|NM_001992.3|

TGAGACTTTAAAAGCATTTTTTAACCTCCTAAGTATCAAGTATAGAAAAT 3109

gi|133892391|ref|NM_010169.3| GCAAACACAAAAAG----TTTTA----CTTAAGT-

CCAA----------- 2805* ** ***** ***** * ***** ***

gi|166362739|ref|NM_001992.3|

CTTCATGGAATTCACAAAGTAATTTGGAAATTAGGTTGAAACATATCTCT 3159


38/43

gi|133892391|ref|NM_010169.3| TTTTAAGAAATATATAGGTCAGTTTGG---

TTA----------------- 2835

** * * *** * * * ***** ***

gi|166362739|ref|NM_001992.3|TATCTTACGAAAAAATGGTAGCATTTTAAACAAAATAGAAAGTTGCAAGG 3209

gi|133892391|ref|NM_010169.3| -----------AAAATAATAGTA------ATGAA--AGGAAATTTCA--- 2863

***** *** * * ** ** **

** **

gi|166362739|ref|NM_001992.3|

CAAATGTTTATTTAAAAGAGCAGGCCAGGCGCGGTGGCTCACGCCTGTAA 3259gi|133892391|ref|NM_010169.3|

-------TTGATTGAAA----------------------------TTTAT 2878

** ** ***

* **

gi|166362739|ref|NM_001992.3|TCCCAGCACTTTGGGAGGCTGAGGCGGGTGGATCACGAGGTCAGGAGATC 3309

gi|133892391|ref|NM_010169.3| TCT--GTATTTT--------------------TCTTGAGTT------ATT 2900

** * * *** ** *** *

**

gi|166362739|ref|NM_001992.3|

GAGACCATCCTGGCTAACACGGTGAAACCCGTCTCTACTAAAAATGCAAA 3359gi|133892391|ref|NM_010169.3| GAGATTATTT-----------GTAAAGC--ATTTTT------

AATGCCAC 2931

**** ** ** ** * * * *

***** *

gi|166362739|ref|NM_001992.3|AAAAATTAGCCGGGCGTGGTGGCAGGCACCTGTAGTCCCAGCTACTCGGG 3409

gi|133892391|ref|NM_010169.3| AGTGACTA-------------ACAAGCATATAAAATCTTCA-TAC----- 2962

* * ** ** *** * * **

***

gi|166362739|ref|NM_001992.3|

AGGCTGAGGCAGGAGACTGGCGTGAACCCAGGAGGCGGACCTTGTAGTGA 3459gi|133892391|ref|NM_010169.3| ---CTTTGACAAAA---

TAATTTGAA-------------------AATTA 2987** * ** * * ****

* * *

gi|166362739|ref|NM_001992.3|

GCCGAGATCGCGCCACTGTGCTCCAGCCTGGGCAACAGAGCAAGACTCCA 3509gi|133892391|ref|NM_010169.3| ATTTAAAACATATCCTTTTTCT--------GATGAAAAAATATGTTGGCA 3029

* * * * * * ** * * * * *

* **

gi|166362739|ref|NM_001992.3| TCTCAAA-

AAATAAAAATAAATAAAAAATAAAAAAATAAAAGAGCAAACT 3558gi|133892391|ref|NM_010169.3| TTTTAAGCAAATAAGAGTAGA--

AAGGTTGTTTATTTAAGAGAACAAAGT 3077


39/43

* * ** ****** * ** * ** * * ***

*** **** *

gi|166362739|ref|NM_001992.3|

ATTTCCAAATACCATAGAATAACTTACATAAAAGTAATATAACTGTATTG 3608gi|133892391|ref|NM_010169.3|

ATTTCCAAATACTGTAGAGTCGCTTCCACGAAAGTCCTATGGTTGTATGG 3127************ **** * *** ** ***** ***

***** *

gi|166362739|ref|NM_001992.3|TAAGTAGAAGCTAGCACTGGTTTTATTAATTTAGTGACTATTCATTTTAT 3658

gi|133892391|ref|NM_010169.3| TTAAC-----TTGGTTCCGGTGTT-----------GGCTG--------AT 3153

* * * * * *** ** * **

**

gi|166362739|ref|NM_001992.3|

CTAAATCAGTGAAGATTTACTGTCATTGTTTATTAGTCTGTATATATTAA 3708gi|133892391|ref|NM_010169.3| CTCAATTACTGA---CTCCCTGTC-CCGTGT-----

TCTGTCTGTGACTT 3194** *** * *** * ***** ** * *****

* *

gi|166362739|ref|NM_001992.3| AATATGA-TATCATTAATGTACTTACAAAATAGTATGTCACTGTTTTTAT 3757

gi|133892391|ref|NM_010169.3|AATGTAACTGTTATCACCGCGCTTGTGACCTTTTACGTCATTGTTTT-GT 3243

*** * * * * ** * * *** * * ** ****

****** *

gi|166362739|ref|NM_001992.3| GTTCA-----

TTCTTAAAAACATAACCTGTATTAATAAATGTGAACATTT 3802gi|133892391|ref|NM_010169.3| GTTCACCCTCTTTTTTAAAAAAAAA--TATATTAATAAAC-

TAAAACCAT 3290***** ** ** **** * ** * ********** *

** *

gi|166362739|ref|NM_001992.3|GCTTGGTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 3847

gi|133892391|ref|NM_010169.3|GCTTGG--------------------------------------- 3296

******


40/43

RESULTS:

Result files

Input Sequences

clustalw2-I20110331-115850-0033-1317260-oy.input

Tool Ouput

clustalw2-I20110331-115850-0033-1317260-oy.output

Alignments in CLUSTALW format

clustalw2-I20110331-115850-0033-1317260-oy.clustalw

Guide Tree

clustalw2-I20110331-115850-0033-1317260-oy.dnd

Scores Table

SeqA Name Length SeqB Name Length Score

1 gi|166 362739|ref|NM_ 001992.3| 3847 2 gi|133892 391|ref|NM_010 169.3| 3 296 70.0
http://www.ebi.ac.uk/Tools/services/rest/clustalw2/result/clustalw2-I20110331-115850-0033-1317260-oy/sequencehttp://www.ebi.ac.uk/Tools/services/rest/clustalw2/result/clustalw2-I20110331-115850-0033-1317260-oy/outhttp://www.ebi.ac.uk/Tools/services/rest/clustalw2/result/clustalw2-I20110331-115850-0033-1317260-oy/aln-clustalwhttp://www.ebi.ac.uk/Tools/services/rest/clustalw2/result/clustalw2-I20110331-115850-0033-1317260-oy/treehttp://www.ebi.ac.uk/Tools/services/rest/clustalw2/result/clustalw2-I20110331-115850-0033-1317260-oy/sequencehttp://www.ebi.ac.uk/Tools/services/rest/clustalw2/result/clustalw2-I20110331-115850-0033-1317260-oy/outhttp://www.ebi.ac.uk/Tools/services/rest/clustalw2/result/clustalw2-I20110331-115850-0033-1317260-oy/aln-clustalwhttp://www.ebi.ac.uk/Tools/services/rest/clustalw2/result/clustalw2-I20110331-115850-0033-1317260-oy/tree


41/43

EXPERIMENT NO.7

AIM :To perform pairwise sequence alignment for two retrieved sequences using BLAST

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or

protein to identify regions of similarity that may be a consequence of functional, structural,

or evolutionary relationships between the sequences. Aligned sequences of nucleotide or

amino acid residues are typically represented as rows within a matrix. Gaps are inserted

between the residues so that identical or similar characters are aligned in successive

columns. Very short or very similar sequences can be aligned by hand. However, most

interesting problems require the alignment of lengthy, highly variable or extremely

numerous sequences that cannot be aligned solely by human effort. Instead, human

knowledge is applied in constructing algorithms to produce high-quality sequence

alignments, and occasionally in adjusting the final results to reflect patterns that are

difficult to represent algorithmically (especially in the case of nucleotide sequences).

Computational approaches to sequence alignment generally fall into two categories: global

alignments and local alignments. Calculating a global alignment is a form of global

optimization that "forces" the alignment to span the entire length of all query sequences.By contrast, local alignments identify regions of similarity within long sequences that are

often widely divergent overall. In bioinformatics, local alignment is mainly performed

using the Basic local alignment search tool or BLAST. A BLAST search enables aresearcher to compare a query sequence with a library or database of sequences, and

identify library sequences that resemble the query sequence above a certain threshold.

BLAST is one of the most widely used bioinformatics programs[2], because it addresses a

fundamental problem and the algorithm emphasizes speed over sensitivity. This emphasis

on speed is vital to making the algorithm practical on the huge genome databases currently

available, although subsequent algorithms can be even faster. Input sequences in BLAST

are in FASTA format or Genbank format.
http://en.wikipedia.org/wiki/BLAST#cite_note-1http://en.wikipedia.org/wiki/BLAST#cite_note-1


42/43

BLAST output can be delivered in a variety of formats. These formats include HTML, plain text,

and XML formatting. For NCBIs web-page, the default format for output is HTML. When

performing a BLAST on NCBI, the results are given in a graphical format showing the hits

found, a table showing sequence identifiers for the hits with scoring related data, as well as

alignments for the sequence of interest and the hits received with corresponding BLAST

scores for these. Using a heuristic method, BLAST finds homologous sequences, not by

comparing either sequence in its entirety, but rather by locating short matches between the

two sequences. This process of finding initial words is called seeding. It is after this first

match that BLAST begins to make local alignments. While attempting to find homology in

sequences, sets of common letters, known as words, are very important.The heuristicalgorithm of BLAST locates all common words between the sequence of interest and the

hit sequence, or sequences, from the database. These results will then be used to build an

alignment. After making words for the sequence of interest, neighborhood words are also

assembled. These words must satisfy a requirement of having a score of at least the

threshold, T, when compared by using a scoring matrix.The threshold score T, determineswhether a particular word will be included in the alignment or not. Once seeding has been

conducted, the alignment, which is only 3 residues long, is extended in both directions by

the algorithm used by BLAST. Each extension impacts the score of the alignment by either

increasing or decreasing it. Should this score be higher than a pre-determined T, the

alignment will be included in the results given by BLAST. However, should this score be

lower than this pre-determined T, the alignment will cease to extend, preventing areas of

poor alignment to be included in the BLAST results.

PROCEDURE:

Search for blast on google homepage and click on http://blast.ncbi. nlm. nih.gov/ Blast.cgi?

CMD=Web&PAGE_TYPE=BlastHome

Select the BLAST type you want to perform, for instance select nucleotide blast

Submit the sequence to be searched either in the FASTA format or in the form of NCBI

accession no.

Select the database from which sequence is to be searched

Click on BLAST


43/43

genomics file

Documents

Transcript of genomics file