Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on...

31
Woods Hole, Massachusetts Woods Hole, Massachusetts July 25, 2005, 7 to July 25, 2005, 7 to 10 PM 10 PM Marine Biological Marine Biological Laboratory — Workshop Laboratory — Workshop on Molecular Evolution on Molecular Evolution
  • date post

    18-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    1

Transcript of Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on...

Page 1: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

Woods Hole, MassachusettsWoods Hole, Massachusetts

July 25, 2005, 7 to 10 PMJuly 25, 2005, 7 to 10 PM

Marine Biological Laboratory Marine Biological Laboratory — Workshop on Molecular — Workshop on Molecular

EvolutionEvolution

Page 2: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

More data yields stronger analyses — if done carefully!More data yields stronger analyses — if done carefully!

Mosaic ideas and evolutionary ‘importance.’Mosaic ideas and evolutionary ‘importance.’

Multiple Sequence Multiple Sequence Alignment & Analysis Alignment & Analysis thru GCG’s SeqLabthru GCG’s SeqLab

Steven M. ThompsonSteven M. Thompson

Florida State University School of Florida State University School of Computational Science (SCS)Computational Science (SCS)

Page 3: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

But first a prelude: My definitions

Biocomputing and computational biology are synonymous and Biocomputing and computational biology are synonymous and

describe the use of computers and computational techniques to describe the use of computers and computational techniques to

analyze any biological system, from molecules, through cells, analyze any biological system, from molecules, through cells,

tissues, and organisms, all the way to populations.tissues, and organisms, all the way to populations.

Bioinformatics describes using computational techniques to access, Bioinformatics describes using computational techniques to access,

analyze, and interpret the biological information in any of the analyze, and interpret the biological information in any of the

available biological databases.available biological databases.

Sequence analysis is the study of molecular sequence data for the Sequence analysis is the study of molecular sequence data for the

purpose of inferring the function, mechanism, interactions, purpose of inferring the function, mechanism, interactions,

evolution, and perhaps structure of biological molecules.evolution, and perhaps structure of biological molecules.

Genomics analyzes the context of genes or complete genomes (the Genomics analyzes the context of genes or complete genomes (the

total DNA content of an organism) within and across genomes.total DNA content of an organism) within and across genomes.

Proteomics is the subdivision of genomics concerned with analyzing Proteomics is the subdivision of genomics concerned with analyzing

the complete protein complement, i.e. the proteome, of the complete protein complement, i.e. the proteome, of

organisms, both within and between different organisms.organisms, both within and between different organisms.

Page 4: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

from a ‘virtual’ DNA sequence to actual molecular from a ‘virtual’ DNA sequence to actual molecular physical characterization, not the other way ‘round.physical characterization, not the other way ‘round.

Using bioinformatics tools, you can infer all sorts Using bioinformatics tools, you can infer all sorts of functional, evolutionary, and, structural of functional, evolutionary, and, structural insights into a gene product, without the need insights into a gene product, without the need to isolate and purify massive amounts of to isolate and purify massive amounts of protein! Eventually you can go on to clone protein! Eventually you can go on to clone and express the gene based on that analysis and express the gene based on that analysis using PCR techniques.using PCR techniques.

The computer and molecular databases are an The computer and molecular databases are an essential part of this process.essential part of this process.

And a ‘way’ to think about it:And a ‘way’ to think about it:The reverse biochemistry analogyThe reverse biochemistry analogy

Page 5: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

The exponential growth of molecular sequence databasesYearYear BasePairs Sequences BasePairs Sequences

19821982 680338 680338 606 606

19831983 2274029 2274029 2427 2427

19841984 3368765 3368765 4175 4175

19851985 5204420 5204420 5700 5700

19861986 9615371 9615371 9978 9978

19871987 15514776 15514776 1458414584

19881988 23800000 23800000 2057920579

19891989 34762585 34762585 2879128791

19901990 49179285 49179285 3953339533

19911991 71947426 71947426 55627 55627

19921992 101008486 101008486 78608 78608

19931993 157152442 157152442 143492143492

19941994 217102462 217102462 215273 215273

19951995 384939485 384939485 555694555694

19961996 651972984 651972984 10212111021211

19971997 1160300687 1160300687 17658471765847

19981998 2008761784 2008761784 28378972837897

19991999 3841163011 3841163011 4864570 4864570

20002000 1110106628811101066288 1010602310106023

20012001 1584992143815849921438 1497631014976310

20022002 2850799016628507990166 22318883 2231888320032003 3655336848536553368485 3096841830968418

20042004 4457574517644575745176 4060431940604319

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.htmlhttp://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

& cpu power& cpu power

Doubling time ~ Doubling time ~ 1 year!1 year!

Page 6: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

So what; why even bother? So what; why even bother?

Applications:Applications:

Probe/primer, and motif/profile design;Probe/primer, and motif/profile design;

Graphical illustrations;Graphical illustrations;

Comparative ‘homology’ inference;Comparative ‘homology’ inference;

Molecular evolutionary analysis.Molecular evolutionary analysis.

OK — well, how do you do it?OK — well, how do you do it?

Back to multiple sequence Back to multiple sequence alignment — Applicability?alignment — Applicability?

Page 7: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

Dynamic programming’s complexity Dynamic programming’s complexity increases exponentially with the number of increases exponentially with the number of sequences being compared:sequences being compared:

N-dimensional matrix . . . .N-dimensional matrix . . . .complexity=[sequence length]complexity=[sequence length]number of sequencesnumber of sequences

Page 8: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

See —See —

MSA (‘global’ within ‘bounding box’) andMSA (‘global’ within ‘bounding box’) and

PIMA (‘local’ portions only) on the multiple PIMA (‘local’ portions only) on the multiple alignment page at thealignment page at the

Baylor College of Medicine’s Search Baylor College of Medicine’s Search Launcher —Launcher —

http://searchlauncher.bcm.tmc.edu/ — but, — but,

severely limiting restrictions!severely limiting restrictions!

‘‘Global’ heuristic solutionsGlobal’ heuristic solutions

Page 9: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

Therefore — Therefore — pairwise, pairwise, progressive dynamic progressive dynamic programming restricts the programming restricts the solution to the neighbor-solution to the neighbor-hood of only two hood of only two sequences at a time.sequences at a time.

All sequences are All sequences are compared, pairwise, and compared, pairwise, and then each is aligned to its then each is aligned to its most similar partner or most similar partner or group of partners. Each group of partners. Each group of partners is then group of partners is then aligned to finish the aligned to finish the complete multiple complete multiple sequence alignment.sequence alignment.

Multiple Sequence Dynamic ProgrammingMultiple Sequence Dynamic Programming

Page 10: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

Reliability and the Reliability and the Comparative Approach —Comparative Approach —

explicit homologous correspondence;explicit homologous correspondence;

manual adjustments based on manual adjustments based on knowledge,knowledge,

especially structural, regulatory, and especially structural, regulatory, and functional sites.functional sites.

Therefore, editors like SeqLab andTherefore, editors like SeqLab and

the Ribosomal Database Project:the Ribosomal Database Project:

http://rdp.cme.msu.edu/index.jsphttp://rdp.cme.msu.edu/index.jsp

Page 11: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

Structural & Functional correspondence in Structural & Functional correspondence in the Wisconsin Package’s SeqLab —the Wisconsin Package’s SeqLab —

Page 12: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

Work with proteins!Work with proteins!If at all possible —If at all possible —

Twenty match symbols versus four, plus Twenty match symbols versus four, plus similarity! Way better signal to noise.similarity! Way better signal to noise.

Also guarantees no indels are placed Also guarantees no indels are placed within codons. So translate, then align.within codons. So translate, then align.

Nucleotide sequences will only reliably Nucleotide sequences will only reliably align if they are align if they are veryvery similarsimilar to each to each other. And they will require extensive other. And they will require extensive hand editing and careful consideration.hand editing and careful consideration.

Page 13: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

Beware of aligning apples and Beware of aligning apples and oranges oranges [[and grapefruitand grapefruit]]!!

Parologous Parologous versus versus orthologous;orthologous;

genomic versus genomic versus cDNA;cDNA;

mature versus mature versus precursor.precursor.

Page 14: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

Mask out uncertain areas —Mask out uncertain areas —

Page 15: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

Complications —Complications —Order dependence.Order dependence.

Not that big of a deal.Not that big of a deal.

Substitution matrices and gap penalties.Substitution matrices and gap penalties.

A very big deal!A very big deal!

Regional ‘realignment’ becomes incredibly Regional ‘realignment’ becomes incredibly

important, especially with sequences that important, especially with sequences that

have areas of high and low similarity have areas of high and low similarity

(GCG’ PileUp -InSitu option).(GCG’ PileUp -InSitu option).

Page 16: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

Complications cont. —Complications cont. —

Format hassles!Format hassles!

Specialized format conversion Specialized format conversion tools such as GCG’s From’ tools such as GCG’s From’ and To’ programs and and To’ programs and PAUPSearch.PAUPSearch.

Don Gilbert’s public domain Don Gilbert’s public domain ReadSeq program.ReadSeq program.

Page 17: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

Still more complications —Still more complications —

Indels and missing Indels and missing

data symbols (i.e. data symbols (i.e.

gaps) designation gaps) designation

discrepancy discrepancy

headaches —headaches —

., -, ~, ?, N, or X., -, ~, ?, N, or X

. . . . . Help!. . . . . Help!

Page 18: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

Web resources for pairwise, Web resources for pairwise, progressive multiple alignment —progressive multiple alignment —http://www.techfak.uni-bielefeld.de/bcd/Curric/

MulAli/welcome.html..

http://pbil.univ-lyon1.fr/alignment.html

http://www.ebi.ac.uk/clustalw/

http://searchlauncher.bcm.tmc.edu/

However, problems with very large datasets and However, problems with very large datasets and huge multiple alignments make doing multiple huge multiple alignments make doing multiple sequence alignment on the Web impractical sequence alignment on the Web impractical after your dataset has reached a certain size. after your dataset has reached a certain size. You’ll know it when you’re there!You’ll know it when you’re there!

Page 19: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

If large datasets become intractable for analysis on the Web, what other resources are available?Desktop software solutions — public domain Desktop software solutions — public domain

programs are available, but . . . complicated to programs are available, but . . . complicated to

install, configure, and maintain. User must be install, configure, and maintain. User must be

pretty computer savvy. So, pretty computer savvy. So,

commercial software packages are available, e.g. commercial software packages are available, e.g.

MacVector, DS Gene, DNAsis, DNAStar, etc.,MacVector, DS Gene, DNAsis, DNAStar, etc.,

but . . . license hassles, big expense per but . . . license hassles, big expense per

machine, and Internet and/or CD database machine, and Internet and/or CD database

access all complicate matters!access all complicate matters!

Page 20: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

Therefore, UNIX server-based solutions

Public domain solutions also exist, but now a very cooperative Public domain solutions also exist, but now a very cooperative

systems manager needs to maintain everything for users, so,systems manager needs to maintain everything for users, so,

commercial products, e.g. the Accelrys GCG Wisconsin Package commercial products, e.g. the Accelrys GCG Wisconsin Package [a [a

Pharmacopeia Co.]Pharmacopeia Co.] and the SeqLab Graphical User Interface, simplify and the SeqLab Graphical User Interface, simplify

matters for administrators and users.matters for administrators and users.

One license fee for an entire institution and very fast, convenient One license fee for an entire institution and very fast, convenient

database access on local server disks. Connections from any database access on local server disks. Connections from any

networked terminal or workstation anywhere!networked terminal or workstation anywhere!

Operating system:Operating system: UNIX command line operation hassles; UNIX command line operation hassles;

communications software — telnet, ssh, and terminal emulation; X communications software — telnet, ssh, and terminal emulation; X

graphics; file transfer — ftp, and scp/sftp; and editors — vi, emacs, graphics; file transfer — ftp, and scp/sftp; and editors — vi, emacs,

pico (or desktop word processing followed by file transfer [save as pico (or desktop word processing followed by file transfer [save as

"text only!"]). See my supplement pdf file."text only!"]). See my supplement pdf file.

Page 21: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

The Genetics Computer Group — The Accelrys Wisconsin Package for Sequence AnalysisThe Accelrys Wisconsin Package for Sequence Analysis

Begun in 1982 in Oliver Smithies’ Genetics Dept. lab at the Begun in 1982 in Oliver Smithies’ Genetics Dept. lab at the

University of Wisconsin, Madison, then a private company for over University of Wisconsin, Madison, then a private company for over

10 years, then acquired by the Oxford Molecular Group U.K., and 10 years, then acquired by the Oxford Molecular Group U.K., and

now owned by Pharmacopeia Inc. U.S.A., Accelrys Division, under now owned by Pharmacopeia Inc. U.S.A., Accelrys Division, under

the brand new name, as of May 2005, Discovery Studio GCG.the brand new name, as of May 2005, Discovery Studio GCG.

The suite contains almost 150 programs designed to work in a The suite contains almost 150 programs designed to work in a

“toolbox” fashion. Several simple programs used in succession “toolbox” fashion. Several simple programs used in succession

can lead to sophisticated results.can lead to sophisticated results.

Also ‘internal compatibility,’ i.e. once you learn to use one program, Also ‘internal compatibility,’ i.e. once you learn to use one program,

all programs can be run similarly, and, the output from many all programs can be run similarly, and, the output from many

programs can be used as input for other programs.programs can be used as input for other programs.

Used all over the world by more than 30,000 scientists at over 950 Used all over the world by more than 30,000 scientists at over 950

institutions in more than 35 countries, so learning it here will likely institutions in more than 35 countries, so learning it here will likely

be useful at any other research institution that you may end up at.be useful at any other research institution that you may end up at.

Page 22: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

To answer the always perplexing GCG question — “What sequence(s)? . . . .”

The sequence is in a local GCG format single sequence file in your UNIX The sequence is in a local GCG format single sequence file in your UNIX account. (GCG Reformat and all From & To programs)account. (GCG Reformat and all From & To programs)

The sequence is in a local GCG database in which case you ‘point’ to it by The sequence is in a local GCG database in which case you ‘point’ to it by using any of the GCG database logical names. A colon, “:,” always sets using any of the GCG database logical names. A colon, “:,” always sets the logical name apart from either an accession number or a proper the logical name apart from either an accession number or a proper identifier name or a wildcard expression and they are case insensitive.identifier name or a wildcard expression and they are case insensitive.

The sequence is in a GCG format multiple sequence file, either an MSF The sequence is in a GCG format multiple sequence file, either an MSF (multiple sequence format) file or an RSF (rich sequence format) file. To (multiple sequence format) file or an RSF (rich sequence format) file. To specify sequences contained in a GCG multiple sequence file, supply the specify sequences contained in a GCG multiple sequence file, supply the file name followed by a pair of braces, “{},” containing the sequence file name followed by a pair of braces, “{},” containing the sequence specification, e.g. a wildcard — {specification, e.g. a wildcard — {**}.}.

Finally, the most powerful method of specifying sequences is in a GCG “list” Finally, the most powerful method of specifying sequences is in a GCG “list” file. It is merely a list of other sequence specifications and can even file. It is merely a list of other sequence specifications and can even contain other list files within it. The convention to use a GCG list file in a contain other list files within it. The convention to use a GCG list file in a program is to precede it with an at sign, “@.” Furthermore, one can program is to precede it with an at sign, “@.” Furthermore, one can supply attribute information within list files to specify something special supply attribute information within list files to specify something special about the sequence.about the sequence.

Specifying sequences, GCG style;Specifying sequences, GCG style;in order of increasing power and complexity:in order of increasing power and complexity:

Page 23: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

This is a small example of GCG single sequence format.This is a small example of GCG single sequence format.

Always put some documentation on top, so in the futureAlways put some documentation on top, so in the future

you can figure out what it is you're dealing with! Theyou can figure out what it is you're dealing with! The

line with the two periods is converted to the checksum line.line with the two periods is converted to the checksum line.

example.seq Length: 77 July 21, 1999 09:30 Type: N Check: 4099 ..example.seq Length: 77 July 21, 1999 09:30 Type: N Check: 4099 ..

1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA

51 GATTTAATAG CATGCGATCC CATGGGA51 GATTTAATAG CATGCGATCC CATGGGA

‘‘Clean’ GCG format single sequence file after Clean’ GCG format single sequence file after

‘reformat’ (or any of the From… programs)‘reformat’ (or any of the From… programs)

SeqLab’s Editor mode can also SeqLab’s Editor mode can also

“Import” native GenBank format and “Import” native GenBank format and

ABI or LI-COR trace files!ABI or LI-COR trace files!

Page 24: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

Logical terms for the Wisconsin PackageSequence databases, nucleic acids:Sequence databases, nucleic acids: Sequence databases, amino acids:Sequence databases, amino acids:

GENBANKPLUSGENBANKPLUS all of GenBank plus EST and GSS subdivisionsall of GenBank plus EST and GSS subdivisions GENPEPTGENPEPT GenBank CDS translationsGenBank CDS translations

GBPGBP all of GenBank plus EST and GSS subdivisionsall of GenBank plus EST and GSS subdivisions GPGP GenBank CDS translationsGenBank CDS translations

GENBANKGENBANK all of GenBank except EST and GSS subdivisionsall of GenBank except EST and GSS subdivisions SWISSPROTPLUSSWISSPROTPLUS all of Swiss-Prot and all of SPTrEMBLall of Swiss-Prot and all of SPTrEMBL

GBGB all of GenBank except EST and GSS subdivisionsall of GenBank except EST and GSS subdivisions SWPSWP all of Swiss-Prot and all of SPTrEMBLall of Swiss-Prot and all of SPTrEMBL

BABA GenBank bacterial subdivisionGenBank bacterial subdivision SWISSPROTSWISSPROT all of Swiss-Prot (fully annotated)all of Swiss-Prot (fully annotated)

BACTERIALBACTERIAL GenBank bacterial subdivisionGenBank bacterial subdivision SWSW all of Swiss-Prot (fully annotated) all of Swiss-Prot (fully annotated)

ESTEST GenBank EST (Expressed Sequence Tags) subdivisionGenBank EST (Expressed Sequence Tags) subdivision SPTREMBLSPTREMBL Swiss-Prot preliminary EMBL translationsSwiss-Prot preliminary EMBL translations

GSSGSS GenBank GSS (Genome Survey Sequences) subdivisionGenBank GSS (Genome Survey Sequences) subdivision SPTSPT Swiss-Prot preliminary EMBL translationsSwiss-Prot preliminary EMBL translations

HTCHTC GenBank High Throughput cDNAGenBank High Throughput cDNA PP all of PIR Proteinall of PIR Protein

HTGHTG GenBank High Throughput GenomicGenBank High Throughput Genomic PIRPIR all of PIR Proteinall of PIR Protein

ININ GenBank invertebrate subdivisionGenBank invertebrate subdivision PROTEINPROTEIN PIR fully annotated subdivisionPIR fully annotated subdivision

INVERTEBRATEINVERTEBRATE GenBank invertebrate subdivisionGenBank invertebrate subdivision PIR1PIR1 PIR fully annotated subdivisionPIR fully annotated subdivision

OMOM GenBank other mammalian subdivisionGenBank other mammalian subdivision PIR2PIR2 PIR preliminary subdivisionPIR preliminary subdivision

OTHERMAMMOTHERMAMM GenBank other mammalian subdivisionGenBank other mammalian subdivision PIR3PIR3 PIR unverified subdivisionPIR unverified subdivision

OVOV GenBank other vertebrate subdivision GenBank other vertebrate subdivision PIR4PIR4 PIR unencoded subdivisionPIR unencoded subdivision

OTHERVERTOTHERVERT GenBank other vertebrate subdivision GenBank other vertebrate subdivision NRL_3DNRL_3D PDB 3D protein sequencesPDB 3D protein sequences

PATPAT GenBank patent subdivision GenBank patent subdivision NRLNRL PDB 3D protein sequencesPDB 3D protein sequences

PATENTPATENT GenBank patent subdivision GenBank patent subdivision

PHPH GenBank phage subdivision GenBank phage subdivision

PHAGEPHAGE GenBank phage subdivisionGenBank phage subdivision General data files: General data files:

PLPL GenBank plant subdivision GenBank plant subdivision

PLANTPLANT GenBank plant subdivision GenBank plant subdivision GENMOREDATAGENMOREDATA path to GCG optional data filespath to GCG optional data files

PRPR GenBank primate subdivision GenBank primate subdivision GENRUNDATAGENRUNDATA path to GCG default data filespath to GCG default data files

PRIMATEPRIMATE GenBank primate subdivisionGenBank primate subdivision

RORO GenBank rodent subdivisionGenBank rodent subdivision

RODENTRODENT GenBank rodent subdivisionGenBank rodent subdivision

STSSTS GenBank (sequence tagged sites) subdivisionGenBank (sequence tagged sites) subdivision

SYSY GenBank synthetic subdivisionGenBank synthetic subdivision

SYNTHETICSYNTHETIC GenBank synthetic subdivisionGenBank synthetic subdivision

TAGSTAGS GenBank EST and GSS subdivisionsGenBank EST and GSS subdivisions

UNUN GenBank unannotated subdivisionGenBank unannotated subdivision

UNANNOTATEDUNANNOTATED GenBank unannotated subdivisionGenBank unannotated subdivision

VIVI GenBank viral subdivisionGenBank viral subdivision

VIRALVIRAL GenBank viral subdivisionGenBank viral subdivision

These are easy — These are easy — they make sense and they make sense and you’ll have a vested you’ll have a vested interest.interest.

Page 25: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

GCG MSF & RSF format

The trick is to not forget the Braces and ‘wild card,’ e.g. The trick is to not forget the Braces and ‘wild card,’ e.g.

filename{filename{**}, when specifying!}, when specifying!

!!RICH_SEQUENCE 1.0!!RICH_SEQUENCE 1.0....{{name ef1a_gialaname ef1a_gialadescrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.listdescrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.listtype PROTEINtype PROTEINlongname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}sequence-ID Q08046sequence-ID Q08046checksum 7342checksum 7342offset 23offset 23creation-date 07/11/2001 16:51:19creation-date 07/11/2001 16:51:19strand 1strand 1comments ////////////////////////////////////////////////////////////comments ////////////////////////////////////////////////////////////

!!AA_MULTIPLE_ALIGNMENT 1.0!!AA_MULTIPLE_ALIGNMENT 1.0

small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 ..small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 ..

Name: a49171 Len: 425 Check: 537 Weight: 1.00Name: a49171 Len: 425 Check: 537 Weight: 1.00 Name: e70827 Len: 577 Check: 21 Weight: 1.00Name: e70827 Len: 577 Check: 21 Weight: 1.00 Name: g83052 Len: 718 Check: 9535 Weight: 1.00Name: g83052 Len: 718 Check: 9535 Weight: 1.00 Name: f70556 Len: 534 Check: 3494 Weight: 1.00Name: f70556 Len: 534 Check: 3494 Weight: 1.00 Name: t17237 Len: 229 Check: 9552 Weight: 1.00Name: t17237 Len: 229 Check: 9552 Weight: 1.00 Name: s65758 Len: 735 Check: 111 Weight: 1.00Name: s65758 Len: 735 Check: 111 Weight: 1.00 Name: a46241 Len: 274 Check: 3514 Weight: 1.00Name: a46241 Len: 274 Check: 3514 Weight: 1.00

// //////////////////////////////////////////////////// //////////////////////////////////////////////////

This is SeqLab’s native formatThis is SeqLab’s native format

Page 26: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

The List File Format

An example GCG list file of many elongation An example GCG list file of many elongation

1a and Tu factors follows. As with all GCG 1a and Tu factors follows. As with all GCG

data files, two periods separate data files, two periods separate

documentation from data. ..documentation from data. ..

my-special.pepmy-special.pep begin:24begin:24 end:134end:134

SwissProt:EfTu_EcoliSwissProt:EfTu_Ecoli

Ef1a-Tu.msf{*}Ef1a-Tu.msf{*}

/usr/accounts/test/another.rsf{ef1a_*}/usr/accounts/test/another.rsf{ef1a_*}

@[email protected]

The ‘way’ SeqLab works!The ‘way’ SeqLab works!

remember the @ sign!remember the @ sign!

Page 27: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

SeqLab — GCG’s X-based GUI!

SeqLab is the merger of Steve Smith’s Genetic SeqLab is the merger of Steve Smith’s Genetic

Data Environment and GCG’s Wisconsin Data Environment and GCG’s Wisconsin

Package Interface:Package Interface:

GDE + WPI = SeqLabGDE + WPI = SeqLab

Requires an X-Windowing environment — Requires an X-Windowing environment —

either native on UNIX computers (including either native on UNIX computers (including

LINUX, but not installed by default on Mac OS LINUX, but not installed by default on Mac OS

X [v.10+] systems, however, see Apple’s free X [v.10+] systems, however, see Apple’s free

X11 package or XDarwin), or emulated with X-X11 package or XDarwin), or emulated with X-

Server Software on personal computers.Server Software on personal computers.

Page 28: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

FOR MORE INFO...FOR MORE INFO...

Explore my Web Home: http://bio.fsu.edu/~stevet/cv.html.Explore my Web Home: http://bio.fsu.edu/~stevet/cv.html.

Contact me (Contact me (stevetstevet@[email protected]) for specific long-distance ) for specific long-distance bioinformatics assistance and collaboration.bioinformatics assistance and collaboration.

Gunnar von Heijne in his old but quite readable treatise, Gunnar von Heijne in his old but quite readable treatise, Sequence Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit (1987), provides a very appropriate conclusion:(1987), provides a very appropriate conclusion:

““Think about what you’re doing; use your knowledge of the molecular Think about what you’re doing; use your knowledge of the molecular system involved to guide both your interpretation of results and your system involved to guide both your interpretation of results and your direction of inquiry; use as much information as possible; and direction of inquiry; use as much information as possible; and do not do not blindly accept everything the computer offers youblindly accept everything the computer offers you.”.”

He continues:He continues:

““. . . if any lesson is to be drawn . . . it surely is that to be able to make a . . . if any lesson is to be drawn . . . it surely is that to be able to make a useful contribution one must first and foremost be a biologist, and only useful contribution one must first and foremost be a biologist, and only second a theoretician . . . . We have to develop better algorithms, we second a theoretician . . . . We have to develop better algorithms, we have to find ways to cope with the massive amounts of data, and above have to find ways to cope with the massive amounts of data, and above all we have to become better biologists. But that’s all it takes.”all we have to become better biologists. But that’s all it takes.”

Conclusions —Conclusions —

Page 29: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

Many texts are now available in Many texts are now available in

the field. the field. To ‘honk-my-own-horn’ a bit, To ‘honk-my-own-horn’ a bit,

check out:check out:

Current Protocols in BioinformaticsCurrent Protocols in Bioinformatics

from John Wiley & Sons, Inc.from John Wiley & Sons, Inc.

(http://www.does.org/cp/bioinfo.html);(http://www.does.org/cp/bioinfo.html);

and Horizon Scientific and Horizon Scientific

Press’ Press’

Computational Computational

Genomics: Theory and Genomics: Theory and

ApplicationApplication

((http://http://

www.horizonpress.com/hsp/www.horizonpress.com/hsp/

books/com.html).books/com.html).

AND FOR EVEN MORE INFO...

Humana Press’ Humana Press’

Introduction to Bioinformatics:Introduction to Bioinformatics:

A Theoretical And Practical ApproachA Theoretical And Practical Approach

((http://www.humanapress.com/http://www.humanapress.com/

Product.pasp?Product.pasp?

txtCatalog=HumanaBooks&txtCategorytxtCatalog=HumanaBooks&txtCategory

=&txtProductID=1-58829-241-=&txtProductID=1-58829-241-

X&isVariant=0X&isVariant=0););

They all asked me to They all asked me to

contribute chapters on contribute chapters on

multiple sequence multiple sequence

alignment and analysis alignment and analysis

using GCG software.using GCG software.

Page 30: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

References —References —Bailey, T.L. and Elkan, C., (1994) Fitting a mixture model by expectation maximization to discover motifs in Bailey, T.L. and Elkan, C., (1994) Fitting a mixture model by expectation maximization to discover motifs in

biopolymers, in biopolymers, in Proceedings of the Second International Conference on Intelligent Systems for Molecular Proceedings of the Second International Conference on Intelligent Systems for Molecular BiologyBiology, AAAI Press, Menlo Park, California, U.S.A. pp. 28–36., AAAI Press, Menlo Park, California, U.S.A. pp. 28–36.

Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids ResearchNucleic Acids Research 2020, 2013-, 2013-2018.2018.

Eddy, S.R. (1996) Hidden Markov models. Eddy, S.R. (1996) Hidden Markov models. Current Opinion in Structural BiologyCurrent Opinion in Structural Biology 66, 361–365., 361–365.

Eddy, S.R. (1998) Profile hidden Markov models. Eddy, S.R. (1998) Profile hidden Markov models. BioinformaticsBioinformatics 1414, 755--763, 755--763

Felsenstein, J. (1993–2005) PHYLIP (Phylogeny Inference Package) Distributed by the author. Dept. of Genetics, Felsenstein, J. (1993–2005) PHYLIP (Phylogeny Inference Package) Distributed by the author. Dept. of Genetics, University of Washington, Seattle, Washington, U.S.A.University of Washington, Seattle, Washington, U.S.A.

Feng, D.F. and Doolittle, R. F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic Feng, D.F. and Doolittle, R. F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. trees. Journal of Molecular EvolutionJournal of Molecular Evolution 2525, 351–360 ., 351–360 .

Genetics Computer Group (Copyright 1982–2005) Genetics Computer Group (Copyright 1982–2005) Program Manual for the Wisconsin PackageProgram Manual for the Wisconsin Package , Version 10.3, , Version 10.3, Accelrys, subsidiary of Pharmocopeia Inc.Accelrys, subsidiary of Pharmocopeia Inc.

Gilbert, D.G. (1993 [C release] and 1999 [Java release]) ReadSeq, public domain software distributed by the Gilbert, D.G. (1993 [C release] and 1999 [Java release]) ReadSeq, public domain software distributed by the author. http://iubio.bio.indiana.edu/soft/molbio/readseq/ Bioinformatics Group, Biology Department, Indiana author. http://iubio.bio.indiana.edu/soft/molbio/readseq/ Bioinformatics Group, Biology Department, Indiana University, Bloomington, Indiana,U.S.A.University, Bloomington, Indiana,U.S.A.

Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Proc. Natl. Acad. Sci. U.S.A.Acad. Sci. U.S.A. 8484, 4355-4358., 4355-4358.

Gupta, S.K., Kececioglu, J.D., and Schaffer, A.A. (1995) Improving the practical space and time efficiency of the Gupta, S.K., Kececioglu, J.D., and Schaffer, A.A. (1995) Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. shortest-paths approach to sum-of-pairs multiple sequence alignment. Journal of Computational BiologyJournal of Computational Biology 22, , 459–472.459–472.

Smith, R.F. and Smith, T.F. (1992) Pattern-induced multi-sequence alignment (PIMA) algorithm employing Smith, R.F. and Smith, T.F. (1992) Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for comparative protein modelling. secondary structure-dependent gap penalties for comparative protein modelling. Protein EngineeringProtein Engineering 55, 35–, 35–41.41.

Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) (1989-1993) Illinois Natural History Survey, (1994) Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) (1989-1993) Illinois Natural History Survey, (1994) personal copyright, (1995–2000) Smithsonian Institution, Washington D.C., U.S.A., and (2001–2005) Florida personal copyright, (1995–2000) Smithsonian Institution, Washington D.C., U.S.A., and (2001–2005) Florida State University, School of Computational Science, Tallahassee, Florida, U.S.A.State University, School of Computational Science, Tallahassee, Florida, U.S.A.

Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins,D.G. (1997) The ClustalX windows Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins,D.G. (1997) The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Nucleic Acids ResearchResearch 2424, 4876–4882., 4876–4882.

Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. choice. Nucleic Acids ResearchNucleic Acids Research, , 2222, 4673-4680., 4673-4680.

Page 31: Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

On to a demonstration of some On to a demonstration of some

of SeqLab’s multiple sequence of SeqLab’s multiple sequence

dataset capabilities —dataset capabilities —

Glutathione Reductase, G-protein Glutathione Reductase, G-protein

coupled TM7 receptors, primate prions, coupled TM7 receptors, primate prions,

Human Papilloma Virus L1 major coat Human Papilloma Virus L1 major coat

protein, Major Histocompatibility Class protein, Major Histocompatibility Class

II, Vicilin seed storage proteins, and II, Vicilin seed storage proteins, and

Elongation Factor 1Elongation Factor 1/Tu./Tu.