EMBOSS – an application suite for Bioinformatics Shahid Manzoor Adnan Niazi SLU Global...

Post on 14-Dec-2015

221 views 0 download

Tags:

Transcript of EMBOSS – an application suite for Bioinformatics Shahid Manzoor Adnan Niazi SLU Global...

EMBOSS – an application suite for Bioinformatics

Shahid ManzoorShahid Manzoor

Adnan NiaziAdnan NiaziSLU Global Bioinformatics Centre

E – European

M – Molecular

B – Biology

O – Open

S – Software

S - SuiteSLU Global Bioinformatics Centre

SLU Global Bioinformatics Centre

All Information

EMBOSS info at http://emboss.sourceforge.net/.

wEMBOSS info at http://wemboss.sourceforge.net/.

E-mail martin.norling@slu.se to get a username and password for

wEMBOSS at http://ebiokit.hgen.slu.se/.

SLU Global Bioinformatics Centre

Open Source molecular biology analysis package.

Handles a variety of common file formats.

Provides libraries for easy development

Software, licensed under GPL and LGPL

Developed by Martin Sarachu and Marc Colet

Available at http://emboss.sourceforge.net

What is EMBOSS

SLU Global Bioinformatics Centre

A comprehensive set of sequence analysis programs.

All sequence and many alignment and structural formats are Handled.

It runs on practically every UNIX you can think of (and likely some that you can't), plus Windows and OS X.

Each application has the same style of interface so master one and you've mastered them all.

Features of EMBOSS

SLU Global Bioinformatics Centre

Sequence alignment.

Protein motif identification (including domain analysis)

Nucleotide sequence pattern analysis (for example to

identify CpG islands or repeats).

Presentation tools for publications.

Uses for EMBOSS

SLU Global Bioinformatics Centre

Many small and large programs in package (>140).

All programs share a common look and feel.

Easy to run from command line.

Retrieval of sequence data from the web.

Programs in EMBOSS

SLU Global Bioinformatics Centre

The one Argument

help

the –help argument displays a short help for any EMBOSS program.

SLU Global Bioinformatics Centre

wossname

wossname searches the other programs short description for keywords.

The One Command

Large collection of gene and protein analysis tools

Sequence retrieval

Alignments

Primer design

Restriction Mapping

Protein domain searching

Translation

SLU Global Bioinformatics Centre

DNA

Sequence 1

DNA

Sequence 2

dotplot translation

protein local/global alignment

protein

Sequence 1

protein

Sequence 2

multiple sequence alignment

motif and domain

searching

physico-chemical

properties

SLU Global Bioinformatics Centre

AGTGGTCGTGAAGAGAATGCTCCTCCTTTGGAATCTTAA

>SEQ1.fasta

AGTGCTCCTCCCTTAGAATCTTAG

>SEQ2.fasta

Unix% dottup SEQ1.fasta SEQ2.fasta –window 10 &

Unix% dotmatcher SEQ1.fasta SEQ2.fasta –window 10 –threshold 17 &

For an exact match:

For a similarity match:

DotplotsDotplots

SLU Global Bioinformatics Centre

A T G C

A 5 -4 -4 -4

T -4 5 -4 -4

G –4 -4 5 -4

C -4 -4 -4 5

Identity Matrix

Dotplots …Dotplots …

SLU Global Bioinformatics Centre

Window Size is number of bases in a sliding window that is moved along each sequence and compared to generate a single data point on the plot. Window size must be an odd number.

Mismatch Limit determines how similar the two sequences in a window must be to "match". For example, if window size is 9 and mismatch limit is 2, then up to 2 mismatches in a 9 base window will still be classified as a match.

A T G C

A 5 -4 -4 -4

T -4 5 -4 -4

G –4 -4 5 -4

C -4 -4 -4 5

CCTCCTTTGG

CCTCCTTTGG

Score = 50555555555 5

CCTCCTTTGG

CCTCCCTTAG

55-455555 5-4 Score = 32

Pro Leu

Pro Leu

Dotplots …Dotplots …

SLU Global Bioinformatics Centre

DotplotsDotplots

SLU Global Bioinformatics Centre

A dot plot is a simple graphical representation of identical residues between two sequences.

The X axis represents the first sequence (PHO5),

The Y axis represents the second sequence (PHO3)

A dot is plotted for each match between two residues of the sequences.

Diagonal lines reveal regions of identity between the two sequences.

SLU Global Bioinformatics Centre

The dot plot can be adapted to display only word matches, which correspond to a

diagonal of dots in the letter-based dot plot.

Example: alignment of PHO5 and PHO3 coding sequences, with different word sizes.

Dotplots …Dotplots …

SLU Global Bioinformatics Centre

Detecting repeats with a dot plot

Sequence repeats are easily detected in a dot plot when a sequence is

compared to itself.

The main diagonal is completely marked

(by definition, since the sequence is identical do itself)

Repeats appear as segments of lines parallel to the diagonal.

ATGGGTCGTGAAGAGAATGCTCCTCCTTTGGAATCTTAA

>SEQ1.fasta

ATGGCTCCTCCCTTAGAATCTTAG

>SEQ2.fasta

Unix% plotorf SEQ1.fasta –stop TAA, TAG –out GA.plot &

Unix% getorf SEQ1.fasta –minsize 5 –table 0 –find 1 –out GA.getorf &

SLU Global Bioinformatics Centre

PlotorfPlotorf

ATGGGTCGTGAAGAGAATGCTCCTCCTTTGGAATCTTAA

TACCCAGCACTTCTCTTACGAGGAGGAAACCTTAGAATT

Frame -3Frame -2

Frame -1

Frame 1Frame 2

Frame 3

Start and stop codons are located according to the instructions to the program, and the area in between start and stop codons

SLU Global Bioinformatics Centre

Indication of full coding sequence?

Alternative splice form?

SLU Global Bioinformatics Centre

>_1 [17 - 37]

MLLLWNL

>_2 [1 - 36]

MGREENAPPLES*

Using getorf:

stop codon

start methionine

SLU Global Bioinformatics Centre

Unix% transeq SEQ1.fasta –frame 1 –table 0 –sbegin 4 –send 33 -out GA.fasta &

>GA.fastaGREENAPPLES

SLU Global Bioinformatics Centre

Unix% needle GA.fasta A.fasta –gapopen 10 –gapextend 0.5 –matrix EPAM250 &

Unix% water GA.fasta A.fasta –gapopen 10 –gapextend 0.5 –matrix EPAM250 &

>GA.fastaGREENAPPLES

>A.fastaAPPLES

For a global alignment:

For a local alignment:

AlignmentsAlignments

SLU Global Bioinformatics Centre

Alignments …Alignments …

To align two or more sequences in a biologically significant way.

GREENAPPLES

GREENAPPLES

APPLES

APPLES

APPLES

Local (water) Global (needle)

Gap penalty = 10; Extension penalty = 0.5

APPLES

SLU Global Bioinformatics Centre

GREENAPPLESAPPLES

looks like the “apples” motif may be part of a larger domain

APPLES

physicochemical properties

pattern searching

SLU Global Bioinformatics Centre

Physico-chemical propertiesPhysico-chemical properties

Unix% iep GA.fasta –plot -step 0.5 –out GA.IEP &

Unix% pepinfo GA.fasta –hwindow 8 –generalplot –hydropathyplot &

Isoelectric point

General properties

SLU Global Bioinformatics Centre

Physico-chemical propertiesPhysico-chemical properties

D

Y

F W

HK

R

EQ

N

M

AG

C S

P

I V

LT

Aliphatic

Aromatic

Hydrophobic

Tiny

Small

Charged

Positive

Polar

The pepinfo graph of properties is based on this diagram

SLU Global Bioinformatics Centre

Physico-Physico-chemical chemical propertiesproperties

non-polar region with small residues

polar region to one side of non-charged region

SLU Global Bioinformatics Centre

Pattern searchingPattern searching

GREENAPPL---ES

-RE-DAPPL---ES

GREEN---LEAVES

-RE-D---LEAVES

GREENAPPLES>GA.fasta

GREENLEAVES>GL.fasta

REDAPPLES>RA.fasta

REDLEAVES>RL.fasta

[G] (0,1)-R–[E] (1,2)–[ND]–X (3)–L–X (3) – E – S

SLU Global Bioinformatics Centre

Pattern searchingPattern searching

Unix% fuzzpro sptr:* pattern.fruit –mismatch 0 –out GA.fuzzpro &

Search a protein database:

[G] (0,1) - [R] – [E] (1,2) – [ND] –x (3) – [L] –x (3) – [E] – [S]

pattern.fruit

Nothing resembling this pattern is found in the database

- But we could try scanning PRINTS (pscan) and PROSTIE

(patmatmotifs) with one of our sequences.

SLU Global Bioinformatics Centre

SLU Global Bioinformatics Centre

Some Programs

SLU Global Bioinformatics Centre

Some Programs …

SLU Global Bioinformatics Centre

More Information