Bio Perl

96
Bioperl course by Catherine Letondal and Katja Schuerer

description

Tutorial de programacion en bio perl para bioinformatica

Transcript of Bio Perl

Page 1: Bio Perl

Bioperl course

by Catherine Letondal and Katja Schuerer

Bioperl course [httpwwwpasteurfrrechercheunitessisformationbioperl]by Catherine Letondal and Katja Schuerer

Introduction to Bioperl (bioperlorg [httpbioperlorg]) This course introduces to the bioperl modules withexamples and exercises The course content has been upgraded to bioperl 10 (differences with bioperl version 07are displayed in yellow color in the diagrams) Apart from this upgrade the presentation has been reorganized bytopics rather than by daily schedule without any major changes in the content A PDF [supportpdf] version isnow available

Table of Contents

1 General introduction 111 Bioperl documentation 112 General bioperl classes 1

2 Sequences 321 The BioSeqIO class 3

211 Format Converter 322 Sequence classes 4

221 Introduction 4222 Building mechanisms summary 4223 A deeper insight into the BioSeq class 5224 Bioperl rsquoSequencersquo classes structure 10

23 Features and Location classes 11231 Feature 11232 Code reading extracting CDS 13233 Tag system 14234 Location 15235 Graphical view of features 17

24 Sequence analysis tools 183 Alignments 21

31 AlignIO 2132 SimpleAlign 2233 Code reading protal2dna 26

4 Analysis 2741 Blast 27

411 Running Blast 27412 Parsing Blast 28413 BioToolsBPlite family parsers 33414 PSI-BLAST (Position Specific Iterative Blast) 37415 bl2seq Blast 2 sequences 38416 Blast Internal classes structure 39

42 Genscan 415 Databases 45

51 Database classes 4552 Accessing a local database with golden 48

6 Perl Reminders 5161 UML 5162 Perl reminders to use bioperl modules 51

621 References 51622 Filehandles and streams 52623 Exceptions 53624 GetoptStd 54625 Classes 55626 BEGIN block 56

63 Perl reminders for a further advanced understanding of bioperl modules 56631 Modules 56632 Compiler instructions 56633 Tie 57

A Solutions 59A1 Sequences 59A2 Alignments 65A3 Analysis 66A4 Databases 78

List of Figures

21 BioSeq class structure 622 Relation of a SwissProt entry and the corresponding BioSeq object components 623 BioAnnotation package structure 924 Bioperl rsquoSequencersquo classes structure 1025 Features Classes structure 1226 Correspondance between an EMBL entry and bioperl tags 1427 Location Classes Structure 1628 Graphical view of some features of the SwissProt entry BACR_HALHA 1831 AlignIO Classes diagram 2132 Align Classes diagram 2341 Blast Classes diagram 3042 BPLite Classes diagram 3443 Blast internal classes diagram 3944 Genscan Classes Structure 4251 Database Classes structure 4561 UML meanings 51A1 BioSeqIO structure 59

List of Examples

21 SwissProt -gt Fasta 322 Loading a sequence from a remote server 423 Find the references to the PDB database entries 931 Format conversions with AlignIO 2132 Basic methods of SimpleAlign 2233 Filter gap columns 2441 StandAloneBlast run 2742 StandAloneBlast parsing 2843 Parsing from a Blast file 2944 Parsing with BPLite 3345 Running PSI-blast 3746 PSI-Blast SearchIO class 3747 Running bl2seq 3848 Genscan parsing 4149 Genscan parsing with sub-sequences 4151 Database class use 4552 Database Index creation 45

List of Exercises

21 BioSeqIO 322 An universal converter 323 Display a sequence in fasta format 424 More on annotations 1025 Transmembran helices 1026 extractcds 1327 extractcds version 2 1431 Create an alignment without gaps 2641 Running Blast on a Swissprot entry 2742 Running Blast Setting parameters 2743 Running Blast Saving output 2744 Running a Remote Blast 2745 Display Blast hits 2846 Class of a Blast report 2947 Parse a Blast output file 3048 Parse Blast results on standard input 3049 Filtering hits by length 31410 Filtering hits by position 31411 Display the best hit by databank 32412 Multiple queries 32413 Extracting the subject sequence 32414 Extracting alignments 32415 Locate EST in a genome 32416 Record Blast hits as sequence features 32417 Print informations from a BPLite report 34418 Create a BioToolsBPlite from a file 34419 Create a BioToolsBPlite from standard input 34420 Parse a PSI-blast report 38421 Build a BioSimpleAlign object 39422 Code reading BioToolsGenscan module 4351 Parse a Genscan report and build a database entry 4652 Parse a Genscan report and build a database entry with the genomic sequence 4753 Build a small bioperl module (for the golden program) 48

Chapter 1 General introduction

Chapter 1 General introduction

11 Bioperl documentation

1Bioperl 10 tutorial [httpbioperlorgCorebptutorialhtml]

2Wiki Bioperl documentation [httpbioperlorgwikihtmlBioPerlFrontPagehtml]

3Modules listing [httpdocbioperlorgreleasesbioperl-101]

4httpdocbioperlorgbioperl-livesbquomacr [httpdocbioperlorgbioperl-live]

5Unix help man and perldoc

12 General bioperl classesGeneral bioperl classes [httpwwwbioperlorgimagesbioperlpdf] diagram (cf Figure 61)

1

Chapter 1 General introduction

2

Chapter 2 Sequences

Chapter 2 Sequences

21 The BioSeqIO class

211 Format Converter

Example 21 SwissProt -gt Fasta

pseudocode

bull Construct aSwissProtformatted sequences input stream

bull Construct afastaformatted sequences output stream

bull for all sequences

bull read the sequence from input stream

bull write it to output stream

in bioperl

localbinperl -w

use strict

use BioSeqIO

my $in = BioSeqIO-gtnewFh ( -file =gt rsquoltseqssprsquo-format =gt rsquoswissrsquo )

my $out = BioSeqIO-gtnewFh ( -file =gt rsquogtseqsfastarsquo-format =gt rsquofastarsquo )

print $out $_ while lt$ingt

datasseqssp [dataseqssp]

3

Chapter 2 Sequences

Exercise 21 BioSeqIO

Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)

Exercise 22 An universal converter

Modify Example 21 to program an universal converter (Solution A2)

22 Sequence classes

221 Introduction

Example 22 Loading a sequence from a remote server

pseudocode

bull Create a sequence object for theBACR_HALHASwissProtentry

bull Print its Accession number and description

in bioperl

use BioDBSwissProt

my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)

print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn

4

Chapter 2 Sequences

Exercise 23 Display a sequence in fasta format

Display theBACR_HALHAentry infastaformat (Solution A3)

222 Building mechanisms summary

de novo

use BioSeq

my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)

from a file

use BioSeqIO

my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq3 = $seqin-gtnext_seq()

my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo

-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt

by fetching an entry from a remote database

use BioDBSwissProt

my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)

by fetching an entry from a bioperl indexed local database

use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)

5

Chapter 2 Sequences

data small_swissinx [datasmall_swissinx]

223 A deeper insight into the BioSeq class

Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components

Figure 21 BioSeq class structure

6

Chapter 2 Sequences

7

Chapter 2 Sequences

Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents

8

Chapter 2 Sequences

Figure 23 BioAnnotation package structure

Example 23 Find the references to the PDB database entries

Does the protein have a known structure and if so what are its cristallographic structures

Where can you find this information within a SwissProt entry

Where can you find this information in the sequence object

9

Chapter 2 Sequences

in bioperl

PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )

if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())

print nPDB Structures join ( structures) n

Exercise 24 More on annotations

Find the function of the protein within comments (if available) (Solution A4)

Go to

Work on Exercise 26 before you continue

Exercise 25 Transmembran helices

Find transmembrane helices in the entry (Solution A5)

Tip

Use Figure 22 again

Use the appropriate objectrsquos documentation

Here is a summary [codesuseSeq10pl] solution of the exercises in this section

224 Bioperl rsquoSequencersquo classes structure

10

Chapter 2 Sequences

Figure 24 Bioperl rsquoSequencersquo classes structure

23 Features and Location classes

231 Feature

11

Chapter 2 Sequences

They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)

Documentation on features

bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]

bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]

Features Classes

bull BioSeqFeatureI

bull BioSeqFeatureGeneric

bull BioSeqFeatureFeaturePair

bull BioSeqFeatureSimilarityPair

bull BioSeqFeatureSimilarity

bull BioSeqFeatureGene

bull BioSeqFeatureGeneExonI

bull BioSeqFeatureGeneExon

bull BioSeqFeatureGeneTranscript

bull BioSeqFeatureGeneTranscriptI

bull BioSeqFeatureGeneGeneStructureI

bull BioSeqFeatureGeneGeneStructure

12

Chapter 2 Sequences

Figure 25 Features Classes structure

232 Code reading extracting CDS

Exercise 26 extractcds

code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )

fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])

13

Chapter 2 Sequences

Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]

Tip

Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class

Tip

Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class

Exercise 27 extractcds version 2

compare Exercise 26 with this version [lecture_codeextractcds2]

233 Tag system

This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])

In the following exampleCDS(feature key) is a primary tagcodon a secondary tag

CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)

The tag system also enables to create new qualifier types Examples in this tutorial

bull Blast hits as features (see Figure 41 and Exercise 416)

bull Creation of a database from a Genscan parsing (Exercise 51)

See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation

14

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 2: Bio Perl

Bioperl course [httpwwwpasteurfrrechercheunitessisformationbioperl]by Catherine Letondal and Katja Schuerer

Introduction to Bioperl (bioperlorg [httpbioperlorg]) This course introduces to the bioperl modules withexamples and exercises The course content has been upgraded to bioperl 10 (differences with bioperl version 07are displayed in yellow color in the diagrams) Apart from this upgrade the presentation has been reorganized bytopics rather than by daily schedule without any major changes in the content A PDF [supportpdf] version isnow available

Table of Contents

1 General introduction 111 Bioperl documentation 112 General bioperl classes 1

2 Sequences 321 The BioSeqIO class 3

211 Format Converter 322 Sequence classes 4

221 Introduction 4222 Building mechanisms summary 4223 A deeper insight into the BioSeq class 5224 Bioperl rsquoSequencersquo classes structure 10

23 Features and Location classes 11231 Feature 11232 Code reading extracting CDS 13233 Tag system 14234 Location 15235 Graphical view of features 17

24 Sequence analysis tools 183 Alignments 21

31 AlignIO 2132 SimpleAlign 2233 Code reading protal2dna 26

4 Analysis 2741 Blast 27

411 Running Blast 27412 Parsing Blast 28413 BioToolsBPlite family parsers 33414 PSI-BLAST (Position Specific Iterative Blast) 37415 bl2seq Blast 2 sequences 38416 Blast Internal classes structure 39

42 Genscan 415 Databases 45

51 Database classes 4552 Accessing a local database with golden 48

6 Perl Reminders 5161 UML 5162 Perl reminders to use bioperl modules 51

621 References 51622 Filehandles and streams 52623 Exceptions 53624 GetoptStd 54625 Classes 55626 BEGIN block 56

63 Perl reminders for a further advanced understanding of bioperl modules 56631 Modules 56632 Compiler instructions 56633 Tie 57

A Solutions 59A1 Sequences 59A2 Alignments 65A3 Analysis 66A4 Databases 78

List of Figures

21 BioSeq class structure 622 Relation of a SwissProt entry and the corresponding BioSeq object components 623 BioAnnotation package structure 924 Bioperl rsquoSequencersquo classes structure 1025 Features Classes structure 1226 Correspondance between an EMBL entry and bioperl tags 1427 Location Classes Structure 1628 Graphical view of some features of the SwissProt entry BACR_HALHA 1831 AlignIO Classes diagram 2132 Align Classes diagram 2341 Blast Classes diagram 3042 BPLite Classes diagram 3443 Blast internal classes diagram 3944 Genscan Classes Structure 4251 Database Classes structure 4561 UML meanings 51A1 BioSeqIO structure 59

List of Examples

21 SwissProt -gt Fasta 322 Loading a sequence from a remote server 423 Find the references to the PDB database entries 931 Format conversions with AlignIO 2132 Basic methods of SimpleAlign 2233 Filter gap columns 2441 StandAloneBlast run 2742 StandAloneBlast parsing 2843 Parsing from a Blast file 2944 Parsing with BPLite 3345 Running PSI-blast 3746 PSI-Blast SearchIO class 3747 Running bl2seq 3848 Genscan parsing 4149 Genscan parsing with sub-sequences 4151 Database class use 4552 Database Index creation 45

List of Exercises

21 BioSeqIO 322 An universal converter 323 Display a sequence in fasta format 424 More on annotations 1025 Transmembran helices 1026 extractcds 1327 extractcds version 2 1431 Create an alignment without gaps 2641 Running Blast on a Swissprot entry 2742 Running Blast Setting parameters 2743 Running Blast Saving output 2744 Running a Remote Blast 2745 Display Blast hits 2846 Class of a Blast report 2947 Parse a Blast output file 3048 Parse Blast results on standard input 3049 Filtering hits by length 31410 Filtering hits by position 31411 Display the best hit by databank 32412 Multiple queries 32413 Extracting the subject sequence 32414 Extracting alignments 32415 Locate EST in a genome 32416 Record Blast hits as sequence features 32417 Print informations from a BPLite report 34418 Create a BioToolsBPlite from a file 34419 Create a BioToolsBPlite from standard input 34420 Parse a PSI-blast report 38421 Build a BioSimpleAlign object 39422 Code reading BioToolsGenscan module 4351 Parse a Genscan report and build a database entry 4652 Parse a Genscan report and build a database entry with the genomic sequence 4753 Build a small bioperl module (for the golden program) 48

Chapter 1 General introduction

Chapter 1 General introduction

11 Bioperl documentation

1Bioperl 10 tutorial [httpbioperlorgCorebptutorialhtml]

2Wiki Bioperl documentation [httpbioperlorgwikihtmlBioPerlFrontPagehtml]

3Modules listing [httpdocbioperlorgreleasesbioperl-101]

4httpdocbioperlorgbioperl-livesbquomacr [httpdocbioperlorgbioperl-live]

5Unix help man and perldoc

12 General bioperl classesGeneral bioperl classes [httpwwwbioperlorgimagesbioperlpdf] diagram (cf Figure 61)

1

Chapter 1 General introduction

2

Chapter 2 Sequences

Chapter 2 Sequences

21 The BioSeqIO class

211 Format Converter

Example 21 SwissProt -gt Fasta

pseudocode

bull Construct aSwissProtformatted sequences input stream

bull Construct afastaformatted sequences output stream

bull for all sequences

bull read the sequence from input stream

bull write it to output stream

in bioperl

localbinperl -w

use strict

use BioSeqIO

my $in = BioSeqIO-gtnewFh ( -file =gt rsquoltseqssprsquo-format =gt rsquoswissrsquo )

my $out = BioSeqIO-gtnewFh ( -file =gt rsquogtseqsfastarsquo-format =gt rsquofastarsquo )

print $out $_ while lt$ingt

datasseqssp [dataseqssp]

3

Chapter 2 Sequences

Exercise 21 BioSeqIO

Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)

Exercise 22 An universal converter

Modify Example 21 to program an universal converter (Solution A2)

22 Sequence classes

221 Introduction

Example 22 Loading a sequence from a remote server

pseudocode

bull Create a sequence object for theBACR_HALHASwissProtentry

bull Print its Accession number and description

in bioperl

use BioDBSwissProt

my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)

print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn

4

Chapter 2 Sequences

Exercise 23 Display a sequence in fasta format

Display theBACR_HALHAentry infastaformat (Solution A3)

222 Building mechanisms summary

de novo

use BioSeq

my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)

from a file

use BioSeqIO

my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq3 = $seqin-gtnext_seq()

my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo

-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt

by fetching an entry from a remote database

use BioDBSwissProt

my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)

by fetching an entry from a bioperl indexed local database

use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)

5

Chapter 2 Sequences

data small_swissinx [datasmall_swissinx]

223 A deeper insight into the BioSeq class

Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components

Figure 21 BioSeq class structure

6

Chapter 2 Sequences

7

Chapter 2 Sequences

Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents

8

Chapter 2 Sequences

Figure 23 BioAnnotation package structure

Example 23 Find the references to the PDB database entries

Does the protein have a known structure and if so what are its cristallographic structures

Where can you find this information within a SwissProt entry

Where can you find this information in the sequence object

9

Chapter 2 Sequences

in bioperl

PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )

if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())

print nPDB Structures join ( structures) n

Exercise 24 More on annotations

Find the function of the protein within comments (if available) (Solution A4)

Go to

Work on Exercise 26 before you continue

Exercise 25 Transmembran helices

Find transmembrane helices in the entry (Solution A5)

Tip

Use Figure 22 again

Use the appropriate objectrsquos documentation

Here is a summary [codesuseSeq10pl] solution of the exercises in this section

224 Bioperl rsquoSequencersquo classes structure

10

Chapter 2 Sequences

Figure 24 Bioperl rsquoSequencersquo classes structure

23 Features and Location classes

231 Feature

11

Chapter 2 Sequences

They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)

Documentation on features

bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]

bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]

Features Classes

bull BioSeqFeatureI

bull BioSeqFeatureGeneric

bull BioSeqFeatureFeaturePair

bull BioSeqFeatureSimilarityPair

bull BioSeqFeatureSimilarity

bull BioSeqFeatureGene

bull BioSeqFeatureGeneExonI

bull BioSeqFeatureGeneExon

bull BioSeqFeatureGeneTranscript

bull BioSeqFeatureGeneTranscriptI

bull BioSeqFeatureGeneGeneStructureI

bull BioSeqFeatureGeneGeneStructure

12

Chapter 2 Sequences

Figure 25 Features Classes structure

232 Code reading extracting CDS

Exercise 26 extractcds

code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )

fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])

13

Chapter 2 Sequences

Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]

Tip

Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class

Tip

Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class

Exercise 27 extractcds version 2

compare Exercise 26 with this version [lecture_codeextractcds2]

233 Tag system

This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])

In the following exampleCDS(feature key) is a primary tagcodon a secondary tag

CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)

The tag system also enables to create new qualifier types Examples in this tutorial

bull Blast hits as features (see Figure 41 and Exercise 416)

bull Creation of a database from a Genscan parsing (Exercise 51)

See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation

14

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 3: Bio Perl

Table of Contents

1 General introduction 111 Bioperl documentation 112 General bioperl classes 1

2 Sequences 321 The BioSeqIO class 3

211 Format Converter 322 Sequence classes 4

221 Introduction 4222 Building mechanisms summary 4223 A deeper insight into the BioSeq class 5224 Bioperl rsquoSequencersquo classes structure 10

23 Features and Location classes 11231 Feature 11232 Code reading extracting CDS 13233 Tag system 14234 Location 15235 Graphical view of features 17

24 Sequence analysis tools 183 Alignments 21

31 AlignIO 2132 SimpleAlign 2233 Code reading protal2dna 26

4 Analysis 2741 Blast 27

411 Running Blast 27412 Parsing Blast 28413 BioToolsBPlite family parsers 33414 PSI-BLAST (Position Specific Iterative Blast) 37415 bl2seq Blast 2 sequences 38416 Blast Internal classes structure 39

42 Genscan 415 Databases 45

51 Database classes 4552 Accessing a local database with golden 48

6 Perl Reminders 5161 UML 5162 Perl reminders to use bioperl modules 51

621 References 51622 Filehandles and streams 52623 Exceptions 53624 GetoptStd 54625 Classes 55626 BEGIN block 56

63 Perl reminders for a further advanced understanding of bioperl modules 56631 Modules 56632 Compiler instructions 56633 Tie 57

A Solutions 59A1 Sequences 59A2 Alignments 65A3 Analysis 66A4 Databases 78

List of Figures

21 BioSeq class structure 622 Relation of a SwissProt entry and the corresponding BioSeq object components 623 BioAnnotation package structure 924 Bioperl rsquoSequencersquo classes structure 1025 Features Classes structure 1226 Correspondance between an EMBL entry and bioperl tags 1427 Location Classes Structure 1628 Graphical view of some features of the SwissProt entry BACR_HALHA 1831 AlignIO Classes diagram 2132 Align Classes diagram 2341 Blast Classes diagram 3042 BPLite Classes diagram 3443 Blast internal classes diagram 3944 Genscan Classes Structure 4251 Database Classes structure 4561 UML meanings 51A1 BioSeqIO structure 59

List of Examples

21 SwissProt -gt Fasta 322 Loading a sequence from a remote server 423 Find the references to the PDB database entries 931 Format conversions with AlignIO 2132 Basic methods of SimpleAlign 2233 Filter gap columns 2441 StandAloneBlast run 2742 StandAloneBlast parsing 2843 Parsing from a Blast file 2944 Parsing with BPLite 3345 Running PSI-blast 3746 PSI-Blast SearchIO class 3747 Running bl2seq 3848 Genscan parsing 4149 Genscan parsing with sub-sequences 4151 Database class use 4552 Database Index creation 45

List of Exercises

21 BioSeqIO 322 An universal converter 323 Display a sequence in fasta format 424 More on annotations 1025 Transmembran helices 1026 extractcds 1327 extractcds version 2 1431 Create an alignment without gaps 2641 Running Blast on a Swissprot entry 2742 Running Blast Setting parameters 2743 Running Blast Saving output 2744 Running a Remote Blast 2745 Display Blast hits 2846 Class of a Blast report 2947 Parse a Blast output file 3048 Parse Blast results on standard input 3049 Filtering hits by length 31410 Filtering hits by position 31411 Display the best hit by databank 32412 Multiple queries 32413 Extracting the subject sequence 32414 Extracting alignments 32415 Locate EST in a genome 32416 Record Blast hits as sequence features 32417 Print informations from a BPLite report 34418 Create a BioToolsBPlite from a file 34419 Create a BioToolsBPlite from standard input 34420 Parse a PSI-blast report 38421 Build a BioSimpleAlign object 39422 Code reading BioToolsGenscan module 4351 Parse a Genscan report and build a database entry 4652 Parse a Genscan report and build a database entry with the genomic sequence 4753 Build a small bioperl module (for the golden program) 48

Chapter 1 General introduction

Chapter 1 General introduction

11 Bioperl documentation

1Bioperl 10 tutorial [httpbioperlorgCorebptutorialhtml]

2Wiki Bioperl documentation [httpbioperlorgwikihtmlBioPerlFrontPagehtml]

3Modules listing [httpdocbioperlorgreleasesbioperl-101]

4httpdocbioperlorgbioperl-livesbquomacr [httpdocbioperlorgbioperl-live]

5Unix help man and perldoc

12 General bioperl classesGeneral bioperl classes [httpwwwbioperlorgimagesbioperlpdf] diagram (cf Figure 61)

1

Chapter 1 General introduction

2

Chapter 2 Sequences

Chapter 2 Sequences

21 The BioSeqIO class

211 Format Converter

Example 21 SwissProt -gt Fasta

pseudocode

bull Construct aSwissProtformatted sequences input stream

bull Construct afastaformatted sequences output stream

bull for all sequences

bull read the sequence from input stream

bull write it to output stream

in bioperl

localbinperl -w

use strict

use BioSeqIO

my $in = BioSeqIO-gtnewFh ( -file =gt rsquoltseqssprsquo-format =gt rsquoswissrsquo )

my $out = BioSeqIO-gtnewFh ( -file =gt rsquogtseqsfastarsquo-format =gt rsquofastarsquo )

print $out $_ while lt$ingt

datasseqssp [dataseqssp]

3

Chapter 2 Sequences

Exercise 21 BioSeqIO

Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)

Exercise 22 An universal converter

Modify Example 21 to program an universal converter (Solution A2)

22 Sequence classes

221 Introduction

Example 22 Loading a sequence from a remote server

pseudocode

bull Create a sequence object for theBACR_HALHASwissProtentry

bull Print its Accession number and description

in bioperl

use BioDBSwissProt

my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)

print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn

4

Chapter 2 Sequences

Exercise 23 Display a sequence in fasta format

Display theBACR_HALHAentry infastaformat (Solution A3)

222 Building mechanisms summary

de novo

use BioSeq

my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)

from a file

use BioSeqIO

my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq3 = $seqin-gtnext_seq()

my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo

-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt

by fetching an entry from a remote database

use BioDBSwissProt

my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)

by fetching an entry from a bioperl indexed local database

use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)

5

Chapter 2 Sequences

data small_swissinx [datasmall_swissinx]

223 A deeper insight into the BioSeq class

Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components

Figure 21 BioSeq class structure

6

Chapter 2 Sequences

7

Chapter 2 Sequences

Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents

8

Chapter 2 Sequences

Figure 23 BioAnnotation package structure

Example 23 Find the references to the PDB database entries

Does the protein have a known structure and if so what are its cristallographic structures

Where can you find this information within a SwissProt entry

Where can you find this information in the sequence object

9

Chapter 2 Sequences

in bioperl

PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )

if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())

print nPDB Structures join ( structures) n

Exercise 24 More on annotations

Find the function of the protein within comments (if available) (Solution A4)

Go to

Work on Exercise 26 before you continue

Exercise 25 Transmembran helices

Find transmembrane helices in the entry (Solution A5)

Tip

Use Figure 22 again

Use the appropriate objectrsquos documentation

Here is a summary [codesuseSeq10pl] solution of the exercises in this section

224 Bioperl rsquoSequencersquo classes structure

10

Chapter 2 Sequences

Figure 24 Bioperl rsquoSequencersquo classes structure

23 Features and Location classes

231 Feature

11

Chapter 2 Sequences

They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)

Documentation on features

bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]

bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]

Features Classes

bull BioSeqFeatureI

bull BioSeqFeatureGeneric

bull BioSeqFeatureFeaturePair

bull BioSeqFeatureSimilarityPair

bull BioSeqFeatureSimilarity

bull BioSeqFeatureGene

bull BioSeqFeatureGeneExonI

bull BioSeqFeatureGeneExon

bull BioSeqFeatureGeneTranscript

bull BioSeqFeatureGeneTranscriptI

bull BioSeqFeatureGeneGeneStructureI

bull BioSeqFeatureGeneGeneStructure

12

Chapter 2 Sequences

Figure 25 Features Classes structure

232 Code reading extracting CDS

Exercise 26 extractcds

code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )

fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])

13

Chapter 2 Sequences

Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]

Tip

Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class

Tip

Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class

Exercise 27 extractcds version 2

compare Exercise 26 with this version [lecture_codeextractcds2]

233 Tag system

This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])

In the following exampleCDS(feature key) is a primary tagcodon a secondary tag

CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)

The tag system also enables to create new qualifier types Examples in this tutorial

bull Blast hits as features (see Figure 41 and Exercise 416)

bull Creation of a database from a Genscan parsing (Exercise 51)

See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation

14

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 4: Bio Perl

63 Perl reminders for a further advanced understanding of bioperl modules 56631 Modules 56632 Compiler instructions 56633 Tie 57

A Solutions 59A1 Sequences 59A2 Alignments 65A3 Analysis 66A4 Databases 78

List of Figures

21 BioSeq class structure 622 Relation of a SwissProt entry and the corresponding BioSeq object components 623 BioAnnotation package structure 924 Bioperl rsquoSequencersquo classes structure 1025 Features Classes structure 1226 Correspondance between an EMBL entry and bioperl tags 1427 Location Classes Structure 1628 Graphical view of some features of the SwissProt entry BACR_HALHA 1831 AlignIO Classes diagram 2132 Align Classes diagram 2341 Blast Classes diagram 3042 BPLite Classes diagram 3443 Blast internal classes diagram 3944 Genscan Classes Structure 4251 Database Classes structure 4561 UML meanings 51A1 BioSeqIO structure 59

List of Examples

21 SwissProt -gt Fasta 322 Loading a sequence from a remote server 423 Find the references to the PDB database entries 931 Format conversions with AlignIO 2132 Basic methods of SimpleAlign 2233 Filter gap columns 2441 StandAloneBlast run 2742 StandAloneBlast parsing 2843 Parsing from a Blast file 2944 Parsing with BPLite 3345 Running PSI-blast 3746 PSI-Blast SearchIO class 3747 Running bl2seq 3848 Genscan parsing 4149 Genscan parsing with sub-sequences 4151 Database class use 4552 Database Index creation 45

List of Exercises

21 BioSeqIO 322 An universal converter 323 Display a sequence in fasta format 424 More on annotations 1025 Transmembran helices 1026 extractcds 1327 extractcds version 2 1431 Create an alignment without gaps 2641 Running Blast on a Swissprot entry 2742 Running Blast Setting parameters 2743 Running Blast Saving output 2744 Running a Remote Blast 2745 Display Blast hits 2846 Class of a Blast report 2947 Parse a Blast output file 3048 Parse Blast results on standard input 3049 Filtering hits by length 31410 Filtering hits by position 31411 Display the best hit by databank 32412 Multiple queries 32413 Extracting the subject sequence 32414 Extracting alignments 32415 Locate EST in a genome 32416 Record Blast hits as sequence features 32417 Print informations from a BPLite report 34418 Create a BioToolsBPlite from a file 34419 Create a BioToolsBPlite from standard input 34420 Parse a PSI-blast report 38421 Build a BioSimpleAlign object 39422 Code reading BioToolsGenscan module 4351 Parse a Genscan report and build a database entry 4652 Parse a Genscan report and build a database entry with the genomic sequence 4753 Build a small bioperl module (for the golden program) 48

Chapter 1 General introduction

Chapter 1 General introduction

11 Bioperl documentation

1Bioperl 10 tutorial [httpbioperlorgCorebptutorialhtml]

2Wiki Bioperl documentation [httpbioperlorgwikihtmlBioPerlFrontPagehtml]

3Modules listing [httpdocbioperlorgreleasesbioperl-101]

4httpdocbioperlorgbioperl-livesbquomacr [httpdocbioperlorgbioperl-live]

5Unix help man and perldoc

12 General bioperl classesGeneral bioperl classes [httpwwwbioperlorgimagesbioperlpdf] diagram (cf Figure 61)

1

Chapter 1 General introduction

2

Chapter 2 Sequences

Chapter 2 Sequences

21 The BioSeqIO class

211 Format Converter

Example 21 SwissProt -gt Fasta

pseudocode

bull Construct aSwissProtformatted sequences input stream

bull Construct afastaformatted sequences output stream

bull for all sequences

bull read the sequence from input stream

bull write it to output stream

in bioperl

localbinperl -w

use strict

use BioSeqIO

my $in = BioSeqIO-gtnewFh ( -file =gt rsquoltseqssprsquo-format =gt rsquoswissrsquo )

my $out = BioSeqIO-gtnewFh ( -file =gt rsquogtseqsfastarsquo-format =gt rsquofastarsquo )

print $out $_ while lt$ingt

datasseqssp [dataseqssp]

3

Chapter 2 Sequences

Exercise 21 BioSeqIO

Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)

Exercise 22 An universal converter

Modify Example 21 to program an universal converter (Solution A2)

22 Sequence classes

221 Introduction

Example 22 Loading a sequence from a remote server

pseudocode

bull Create a sequence object for theBACR_HALHASwissProtentry

bull Print its Accession number and description

in bioperl

use BioDBSwissProt

my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)

print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn

4

Chapter 2 Sequences

Exercise 23 Display a sequence in fasta format

Display theBACR_HALHAentry infastaformat (Solution A3)

222 Building mechanisms summary

de novo

use BioSeq

my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)

from a file

use BioSeqIO

my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq3 = $seqin-gtnext_seq()

my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo

-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt

by fetching an entry from a remote database

use BioDBSwissProt

my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)

by fetching an entry from a bioperl indexed local database

use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)

5

Chapter 2 Sequences

data small_swissinx [datasmall_swissinx]

223 A deeper insight into the BioSeq class

Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components

Figure 21 BioSeq class structure

6

Chapter 2 Sequences

7

Chapter 2 Sequences

Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents

8

Chapter 2 Sequences

Figure 23 BioAnnotation package structure

Example 23 Find the references to the PDB database entries

Does the protein have a known structure and if so what are its cristallographic structures

Where can you find this information within a SwissProt entry

Where can you find this information in the sequence object

9

Chapter 2 Sequences

in bioperl

PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )

if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())

print nPDB Structures join ( structures) n

Exercise 24 More on annotations

Find the function of the protein within comments (if available) (Solution A4)

Go to

Work on Exercise 26 before you continue

Exercise 25 Transmembran helices

Find transmembrane helices in the entry (Solution A5)

Tip

Use Figure 22 again

Use the appropriate objectrsquos documentation

Here is a summary [codesuseSeq10pl] solution of the exercises in this section

224 Bioperl rsquoSequencersquo classes structure

10

Chapter 2 Sequences

Figure 24 Bioperl rsquoSequencersquo classes structure

23 Features and Location classes

231 Feature

11

Chapter 2 Sequences

They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)

Documentation on features

bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]

bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]

Features Classes

bull BioSeqFeatureI

bull BioSeqFeatureGeneric

bull BioSeqFeatureFeaturePair

bull BioSeqFeatureSimilarityPair

bull BioSeqFeatureSimilarity

bull BioSeqFeatureGene

bull BioSeqFeatureGeneExonI

bull BioSeqFeatureGeneExon

bull BioSeqFeatureGeneTranscript

bull BioSeqFeatureGeneTranscriptI

bull BioSeqFeatureGeneGeneStructureI

bull BioSeqFeatureGeneGeneStructure

12

Chapter 2 Sequences

Figure 25 Features Classes structure

232 Code reading extracting CDS

Exercise 26 extractcds

code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )

fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])

13

Chapter 2 Sequences

Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]

Tip

Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class

Tip

Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class

Exercise 27 extractcds version 2

compare Exercise 26 with this version [lecture_codeextractcds2]

233 Tag system

This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])

In the following exampleCDS(feature key) is a primary tagcodon a secondary tag

CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)

The tag system also enables to create new qualifier types Examples in this tutorial

bull Blast hits as features (see Figure 41 and Exercise 416)

bull Creation of a database from a Genscan parsing (Exercise 51)

See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation

14

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 5: Bio Perl

List of Figures

21 BioSeq class structure 622 Relation of a SwissProt entry and the corresponding BioSeq object components 623 BioAnnotation package structure 924 Bioperl rsquoSequencersquo classes structure 1025 Features Classes structure 1226 Correspondance between an EMBL entry and bioperl tags 1427 Location Classes Structure 1628 Graphical view of some features of the SwissProt entry BACR_HALHA 1831 AlignIO Classes diagram 2132 Align Classes diagram 2341 Blast Classes diagram 3042 BPLite Classes diagram 3443 Blast internal classes diagram 3944 Genscan Classes Structure 4251 Database Classes structure 4561 UML meanings 51A1 BioSeqIO structure 59

List of Examples

21 SwissProt -gt Fasta 322 Loading a sequence from a remote server 423 Find the references to the PDB database entries 931 Format conversions with AlignIO 2132 Basic methods of SimpleAlign 2233 Filter gap columns 2441 StandAloneBlast run 2742 StandAloneBlast parsing 2843 Parsing from a Blast file 2944 Parsing with BPLite 3345 Running PSI-blast 3746 PSI-Blast SearchIO class 3747 Running bl2seq 3848 Genscan parsing 4149 Genscan parsing with sub-sequences 4151 Database class use 4552 Database Index creation 45

List of Exercises

21 BioSeqIO 322 An universal converter 323 Display a sequence in fasta format 424 More on annotations 1025 Transmembran helices 1026 extractcds 1327 extractcds version 2 1431 Create an alignment without gaps 2641 Running Blast on a Swissprot entry 2742 Running Blast Setting parameters 2743 Running Blast Saving output 2744 Running a Remote Blast 2745 Display Blast hits 2846 Class of a Blast report 2947 Parse a Blast output file 3048 Parse Blast results on standard input 3049 Filtering hits by length 31410 Filtering hits by position 31411 Display the best hit by databank 32412 Multiple queries 32413 Extracting the subject sequence 32414 Extracting alignments 32415 Locate EST in a genome 32416 Record Blast hits as sequence features 32417 Print informations from a BPLite report 34418 Create a BioToolsBPlite from a file 34419 Create a BioToolsBPlite from standard input 34420 Parse a PSI-blast report 38421 Build a BioSimpleAlign object 39422 Code reading BioToolsGenscan module 4351 Parse a Genscan report and build a database entry 4652 Parse a Genscan report and build a database entry with the genomic sequence 4753 Build a small bioperl module (for the golden program) 48

Chapter 1 General introduction

Chapter 1 General introduction

11 Bioperl documentation

1Bioperl 10 tutorial [httpbioperlorgCorebptutorialhtml]

2Wiki Bioperl documentation [httpbioperlorgwikihtmlBioPerlFrontPagehtml]

3Modules listing [httpdocbioperlorgreleasesbioperl-101]

4httpdocbioperlorgbioperl-livesbquomacr [httpdocbioperlorgbioperl-live]

5Unix help man and perldoc

12 General bioperl classesGeneral bioperl classes [httpwwwbioperlorgimagesbioperlpdf] diagram (cf Figure 61)

1

Chapter 1 General introduction

2

Chapter 2 Sequences

Chapter 2 Sequences

21 The BioSeqIO class

211 Format Converter

Example 21 SwissProt -gt Fasta

pseudocode

bull Construct aSwissProtformatted sequences input stream

bull Construct afastaformatted sequences output stream

bull for all sequences

bull read the sequence from input stream

bull write it to output stream

in bioperl

localbinperl -w

use strict

use BioSeqIO

my $in = BioSeqIO-gtnewFh ( -file =gt rsquoltseqssprsquo-format =gt rsquoswissrsquo )

my $out = BioSeqIO-gtnewFh ( -file =gt rsquogtseqsfastarsquo-format =gt rsquofastarsquo )

print $out $_ while lt$ingt

datasseqssp [dataseqssp]

3

Chapter 2 Sequences

Exercise 21 BioSeqIO

Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)

Exercise 22 An universal converter

Modify Example 21 to program an universal converter (Solution A2)

22 Sequence classes

221 Introduction

Example 22 Loading a sequence from a remote server

pseudocode

bull Create a sequence object for theBACR_HALHASwissProtentry

bull Print its Accession number and description

in bioperl

use BioDBSwissProt

my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)

print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn

4

Chapter 2 Sequences

Exercise 23 Display a sequence in fasta format

Display theBACR_HALHAentry infastaformat (Solution A3)

222 Building mechanisms summary

de novo

use BioSeq

my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)

from a file

use BioSeqIO

my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq3 = $seqin-gtnext_seq()

my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo

-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt

by fetching an entry from a remote database

use BioDBSwissProt

my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)

by fetching an entry from a bioperl indexed local database

use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)

5

Chapter 2 Sequences

data small_swissinx [datasmall_swissinx]

223 A deeper insight into the BioSeq class

Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components

Figure 21 BioSeq class structure

6

Chapter 2 Sequences

7

Chapter 2 Sequences

Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents

8

Chapter 2 Sequences

Figure 23 BioAnnotation package structure

Example 23 Find the references to the PDB database entries

Does the protein have a known structure and if so what are its cristallographic structures

Where can you find this information within a SwissProt entry

Where can you find this information in the sequence object

9

Chapter 2 Sequences

in bioperl

PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )

if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())

print nPDB Structures join ( structures) n

Exercise 24 More on annotations

Find the function of the protein within comments (if available) (Solution A4)

Go to

Work on Exercise 26 before you continue

Exercise 25 Transmembran helices

Find transmembrane helices in the entry (Solution A5)

Tip

Use Figure 22 again

Use the appropriate objectrsquos documentation

Here is a summary [codesuseSeq10pl] solution of the exercises in this section

224 Bioperl rsquoSequencersquo classes structure

10

Chapter 2 Sequences

Figure 24 Bioperl rsquoSequencersquo classes structure

23 Features and Location classes

231 Feature

11

Chapter 2 Sequences

They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)

Documentation on features

bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]

bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]

Features Classes

bull BioSeqFeatureI

bull BioSeqFeatureGeneric

bull BioSeqFeatureFeaturePair

bull BioSeqFeatureSimilarityPair

bull BioSeqFeatureSimilarity

bull BioSeqFeatureGene

bull BioSeqFeatureGeneExonI

bull BioSeqFeatureGeneExon

bull BioSeqFeatureGeneTranscript

bull BioSeqFeatureGeneTranscriptI

bull BioSeqFeatureGeneGeneStructureI

bull BioSeqFeatureGeneGeneStructure

12

Chapter 2 Sequences

Figure 25 Features Classes structure

232 Code reading extracting CDS

Exercise 26 extractcds

code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )

fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])

13

Chapter 2 Sequences

Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]

Tip

Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class

Tip

Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class

Exercise 27 extractcds version 2

compare Exercise 26 with this version [lecture_codeextractcds2]

233 Tag system

This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])

In the following exampleCDS(feature key) is a primary tagcodon a secondary tag

CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)

The tag system also enables to create new qualifier types Examples in this tutorial

bull Blast hits as features (see Figure 41 and Exercise 416)

bull Creation of a database from a Genscan parsing (Exercise 51)

See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation

14

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 6: Bio Perl

List of Examples

21 SwissProt -gt Fasta 322 Loading a sequence from a remote server 423 Find the references to the PDB database entries 931 Format conversions with AlignIO 2132 Basic methods of SimpleAlign 2233 Filter gap columns 2441 StandAloneBlast run 2742 StandAloneBlast parsing 2843 Parsing from a Blast file 2944 Parsing with BPLite 3345 Running PSI-blast 3746 PSI-Blast SearchIO class 3747 Running bl2seq 3848 Genscan parsing 4149 Genscan parsing with sub-sequences 4151 Database class use 4552 Database Index creation 45

List of Exercises

21 BioSeqIO 322 An universal converter 323 Display a sequence in fasta format 424 More on annotations 1025 Transmembran helices 1026 extractcds 1327 extractcds version 2 1431 Create an alignment without gaps 2641 Running Blast on a Swissprot entry 2742 Running Blast Setting parameters 2743 Running Blast Saving output 2744 Running a Remote Blast 2745 Display Blast hits 2846 Class of a Blast report 2947 Parse a Blast output file 3048 Parse Blast results on standard input 3049 Filtering hits by length 31410 Filtering hits by position 31411 Display the best hit by databank 32412 Multiple queries 32413 Extracting the subject sequence 32414 Extracting alignments 32415 Locate EST in a genome 32416 Record Blast hits as sequence features 32417 Print informations from a BPLite report 34418 Create a BioToolsBPlite from a file 34419 Create a BioToolsBPlite from standard input 34420 Parse a PSI-blast report 38421 Build a BioSimpleAlign object 39422 Code reading BioToolsGenscan module 4351 Parse a Genscan report and build a database entry 4652 Parse a Genscan report and build a database entry with the genomic sequence 4753 Build a small bioperl module (for the golden program) 48

Chapter 1 General introduction

Chapter 1 General introduction

11 Bioperl documentation

1Bioperl 10 tutorial [httpbioperlorgCorebptutorialhtml]

2Wiki Bioperl documentation [httpbioperlorgwikihtmlBioPerlFrontPagehtml]

3Modules listing [httpdocbioperlorgreleasesbioperl-101]

4httpdocbioperlorgbioperl-livesbquomacr [httpdocbioperlorgbioperl-live]

5Unix help man and perldoc

12 General bioperl classesGeneral bioperl classes [httpwwwbioperlorgimagesbioperlpdf] diagram (cf Figure 61)

1

Chapter 1 General introduction

2

Chapter 2 Sequences

Chapter 2 Sequences

21 The BioSeqIO class

211 Format Converter

Example 21 SwissProt -gt Fasta

pseudocode

bull Construct aSwissProtformatted sequences input stream

bull Construct afastaformatted sequences output stream

bull for all sequences

bull read the sequence from input stream

bull write it to output stream

in bioperl

localbinperl -w

use strict

use BioSeqIO

my $in = BioSeqIO-gtnewFh ( -file =gt rsquoltseqssprsquo-format =gt rsquoswissrsquo )

my $out = BioSeqIO-gtnewFh ( -file =gt rsquogtseqsfastarsquo-format =gt rsquofastarsquo )

print $out $_ while lt$ingt

datasseqssp [dataseqssp]

3

Chapter 2 Sequences

Exercise 21 BioSeqIO

Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)

Exercise 22 An universal converter

Modify Example 21 to program an universal converter (Solution A2)

22 Sequence classes

221 Introduction

Example 22 Loading a sequence from a remote server

pseudocode

bull Create a sequence object for theBACR_HALHASwissProtentry

bull Print its Accession number and description

in bioperl

use BioDBSwissProt

my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)

print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn

4

Chapter 2 Sequences

Exercise 23 Display a sequence in fasta format

Display theBACR_HALHAentry infastaformat (Solution A3)

222 Building mechanisms summary

de novo

use BioSeq

my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)

from a file

use BioSeqIO

my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq3 = $seqin-gtnext_seq()

my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo

-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt

by fetching an entry from a remote database

use BioDBSwissProt

my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)

by fetching an entry from a bioperl indexed local database

use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)

5

Chapter 2 Sequences

data small_swissinx [datasmall_swissinx]

223 A deeper insight into the BioSeq class

Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components

Figure 21 BioSeq class structure

6

Chapter 2 Sequences

7

Chapter 2 Sequences

Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents

8

Chapter 2 Sequences

Figure 23 BioAnnotation package structure

Example 23 Find the references to the PDB database entries

Does the protein have a known structure and if so what are its cristallographic structures

Where can you find this information within a SwissProt entry

Where can you find this information in the sequence object

9

Chapter 2 Sequences

in bioperl

PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )

if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())

print nPDB Structures join ( structures) n

Exercise 24 More on annotations

Find the function of the protein within comments (if available) (Solution A4)

Go to

Work on Exercise 26 before you continue

Exercise 25 Transmembran helices

Find transmembrane helices in the entry (Solution A5)

Tip

Use Figure 22 again

Use the appropriate objectrsquos documentation

Here is a summary [codesuseSeq10pl] solution of the exercises in this section

224 Bioperl rsquoSequencersquo classes structure

10

Chapter 2 Sequences

Figure 24 Bioperl rsquoSequencersquo classes structure

23 Features and Location classes

231 Feature

11

Chapter 2 Sequences

They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)

Documentation on features

bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]

bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]

Features Classes

bull BioSeqFeatureI

bull BioSeqFeatureGeneric

bull BioSeqFeatureFeaturePair

bull BioSeqFeatureSimilarityPair

bull BioSeqFeatureSimilarity

bull BioSeqFeatureGene

bull BioSeqFeatureGeneExonI

bull BioSeqFeatureGeneExon

bull BioSeqFeatureGeneTranscript

bull BioSeqFeatureGeneTranscriptI

bull BioSeqFeatureGeneGeneStructureI

bull BioSeqFeatureGeneGeneStructure

12

Chapter 2 Sequences

Figure 25 Features Classes structure

232 Code reading extracting CDS

Exercise 26 extractcds

code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )

fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])

13

Chapter 2 Sequences

Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]

Tip

Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class

Tip

Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class

Exercise 27 extractcds version 2

compare Exercise 26 with this version [lecture_codeextractcds2]

233 Tag system

This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])

In the following exampleCDS(feature key) is a primary tagcodon a secondary tag

CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)

The tag system also enables to create new qualifier types Examples in this tutorial

bull Blast hits as features (see Figure 41 and Exercise 416)

bull Creation of a database from a Genscan parsing (Exercise 51)

See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation

14

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 7: Bio Perl

List of Exercises

21 BioSeqIO 322 An universal converter 323 Display a sequence in fasta format 424 More on annotations 1025 Transmembran helices 1026 extractcds 1327 extractcds version 2 1431 Create an alignment without gaps 2641 Running Blast on a Swissprot entry 2742 Running Blast Setting parameters 2743 Running Blast Saving output 2744 Running a Remote Blast 2745 Display Blast hits 2846 Class of a Blast report 2947 Parse a Blast output file 3048 Parse Blast results on standard input 3049 Filtering hits by length 31410 Filtering hits by position 31411 Display the best hit by databank 32412 Multiple queries 32413 Extracting the subject sequence 32414 Extracting alignments 32415 Locate EST in a genome 32416 Record Blast hits as sequence features 32417 Print informations from a BPLite report 34418 Create a BioToolsBPlite from a file 34419 Create a BioToolsBPlite from standard input 34420 Parse a PSI-blast report 38421 Build a BioSimpleAlign object 39422 Code reading BioToolsGenscan module 4351 Parse a Genscan report and build a database entry 4652 Parse a Genscan report and build a database entry with the genomic sequence 4753 Build a small bioperl module (for the golden program) 48

Chapter 1 General introduction

Chapter 1 General introduction

11 Bioperl documentation

1Bioperl 10 tutorial [httpbioperlorgCorebptutorialhtml]

2Wiki Bioperl documentation [httpbioperlorgwikihtmlBioPerlFrontPagehtml]

3Modules listing [httpdocbioperlorgreleasesbioperl-101]

4httpdocbioperlorgbioperl-livesbquomacr [httpdocbioperlorgbioperl-live]

5Unix help man and perldoc

12 General bioperl classesGeneral bioperl classes [httpwwwbioperlorgimagesbioperlpdf] diagram (cf Figure 61)

1

Chapter 1 General introduction

2

Chapter 2 Sequences

Chapter 2 Sequences

21 The BioSeqIO class

211 Format Converter

Example 21 SwissProt -gt Fasta

pseudocode

bull Construct aSwissProtformatted sequences input stream

bull Construct afastaformatted sequences output stream

bull for all sequences

bull read the sequence from input stream

bull write it to output stream

in bioperl

localbinperl -w

use strict

use BioSeqIO

my $in = BioSeqIO-gtnewFh ( -file =gt rsquoltseqssprsquo-format =gt rsquoswissrsquo )

my $out = BioSeqIO-gtnewFh ( -file =gt rsquogtseqsfastarsquo-format =gt rsquofastarsquo )

print $out $_ while lt$ingt

datasseqssp [dataseqssp]

3

Chapter 2 Sequences

Exercise 21 BioSeqIO

Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)

Exercise 22 An universal converter

Modify Example 21 to program an universal converter (Solution A2)

22 Sequence classes

221 Introduction

Example 22 Loading a sequence from a remote server

pseudocode

bull Create a sequence object for theBACR_HALHASwissProtentry

bull Print its Accession number and description

in bioperl

use BioDBSwissProt

my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)

print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn

4

Chapter 2 Sequences

Exercise 23 Display a sequence in fasta format

Display theBACR_HALHAentry infastaformat (Solution A3)

222 Building mechanisms summary

de novo

use BioSeq

my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)

from a file

use BioSeqIO

my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq3 = $seqin-gtnext_seq()

my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo

-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt

by fetching an entry from a remote database

use BioDBSwissProt

my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)

by fetching an entry from a bioperl indexed local database

use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)

5

Chapter 2 Sequences

data small_swissinx [datasmall_swissinx]

223 A deeper insight into the BioSeq class

Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components

Figure 21 BioSeq class structure

6

Chapter 2 Sequences

7

Chapter 2 Sequences

Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents

8

Chapter 2 Sequences

Figure 23 BioAnnotation package structure

Example 23 Find the references to the PDB database entries

Does the protein have a known structure and if so what are its cristallographic structures

Where can you find this information within a SwissProt entry

Where can you find this information in the sequence object

9

Chapter 2 Sequences

in bioperl

PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )

if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())

print nPDB Structures join ( structures) n

Exercise 24 More on annotations

Find the function of the protein within comments (if available) (Solution A4)

Go to

Work on Exercise 26 before you continue

Exercise 25 Transmembran helices

Find transmembrane helices in the entry (Solution A5)

Tip

Use Figure 22 again

Use the appropriate objectrsquos documentation

Here is a summary [codesuseSeq10pl] solution of the exercises in this section

224 Bioperl rsquoSequencersquo classes structure

10

Chapter 2 Sequences

Figure 24 Bioperl rsquoSequencersquo classes structure

23 Features and Location classes

231 Feature

11

Chapter 2 Sequences

They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)

Documentation on features

bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]

bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]

Features Classes

bull BioSeqFeatureI

bull BioSeqFeatureGeneric

bull BioSeqFeatureFeaturePair

bull BioSeqFeatureSimilarityPair

bull BioSeqFeatureSimilarity

bull BioSeqFeatureGene

bull BioSeqFeatureGeneExonI

bull BioSeqFeatureGeneExon

bull BioSeqFeatureGeneTranscript

bull BioSeqFeatureGeneTranscriptI

bull BioSeqFeatureGeneGeneStructureI

bull BioSeqFeatureGeneGeneStructure

12

Chapter 2 Sequences

Figure 25 Features Classes structure

232 Code reading extracting CDS

Exercise 26 extractcds

code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )

fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])

13

Chapter 2 Sequences

Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]

Tip

Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class

Tip

Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class

Exercise 27 extractcds version 2

compare Exercise 26 with this version [lecture_codeextractcds2]

233 Tag system

This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])

In the following exampleCDS(feature key) is a primary tagcodon a secondary tag

CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)

The tag system also enables to create new qualifier types Examples in this tutorial

bull Blast hits as features (see Figure 41 and Exercise 416)

bull Creation of a database from a Genscan parsing (Exercise 51)

See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation

14

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 8: Bio Perl

Chapter 1 General introduction

Chapter 1 General introduction

11 Bioperl documentation

1Bioperl 10 tutorial [httpbioperlorgCorebptutorialhtml]

2Wiki Bioperl documentation [httpbioperlorgwikihtmlBioPerlFrontPagehtml]

3Modules listing [httpdocbioperlorgreleasesbioperl-101]

4httpdocbioperlorgbioperl-livesbquomacr [httpdocbioperlorgbioperl-live]

5Unix help man and perldoc

12 General bioperl classesGeneral bioperl classes [httpwwwbioperlorgimagesbioperlpdf] diagram (cf Figure 61)

1

Chapter 1 General introduction

2

Chapter 2 Sequences

Chapter 2 Sequences

21 The BioSeqIO class

211 Format Converter

Example 21 SwissProt -gt Fasta

pseudocode

bull Construct aSwissProtformatted sequences input stream

bull Construct afastaformatted sequences output stream

bull for all sequences

bull read the sequence from input stream

bull write it to output stream

in bioperl

localbinperl -w

use strict

use BioSeqIO

my $in = BioSeqIO-gtnewFh ( -file =gt rsquoltseqssprsquo-format =gt rsquoswissrsquo )

my $out = BioSeqIO-gtnewFh ( -file =gt rsquogtseqsfastarsquo-format =gt rsquofastarsquo )

print $out $_ while lt$ingt

datasseqssp [dataseqssp]

3

Chapter 2 Sequences

Exercise 21 BioSeqIO

Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)

Exercise 22 An universal converter

Modify Example 21 to program an universal converter (Solution A2)

22 Sequence classes

221 Introduction

Example 22 Loading a sequence from a remote server

pseudocode

bull Create a sequence object for theBACR_HALHASwissProtentry

bull Print its Accession number and description

in bioperl

use BioDBSwissProt

my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)

print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn

4

Chapter 2 Sequences

Exercise 23 Display a sequence in fasta format

Display theBACR_HALHAentry infastaformat (Solution A3)

222 Building mechanisms summary

de novo

use BioSeq

my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)

from a file

use BioSeqIO

my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq3 = $seqin-gtnext_seq()

my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo

-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt

by fetching an entry from a remote database

use BioDBSwissProt

my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)

by fetching an entry from a bioperl indexed local database

use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)

5

Chapter 2 Sequences

data small_swissinx [datasmall_swissinx]

223 A deeper insight into the BioSeq class

Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components

Figure 21 BioSeq class structure

6

Chapter 2 Sequences

7

Chapter 2 Sequences

Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents

8

Chapter 2 Sequences

Figure 23 BioAnnotation package structure

Example 23 Find the references to the PDB database entries

Does the protein have a known structure and if so what are its cristallographic structures

Where can you find this information within a SwissProt entry

Where can you find this information in the sequence object

9

Chapter 2 Sequences

in bioperl

PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )

if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())

print nPDB Structures join ( structures) n

Exercise 24 More on annotations

Find the function of the protein within comments (if available) (Solution A4)

Go to

Work on Exercise 26 before you continue

Exercise 25 Transmembran helices

Find transmembrane helices in the entry (Solution A5)

Tip

Use Figure 22 again

Use the appropriate objectrsquos documentation

Here is a summary [codesuseSeq10pl] solution of the exercises in this section

224 Bioperl rsquoSequencersquo classes structure

10

Chapter 2 Sequences

Figure 24 Bioperl rsquoSequencersquo classes structure

23 Features and Location classes

231 Feature

11

Chapter 2 Sequences

They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)

Documentation on features

bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]

bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]

Features Classes

bull BioSeqFeatureI

bull BioSeqFeatureGeneric

bull BioSeqFeatureFeaturePair

bull BioSeqFeatureSimilarityPair

bull BioSeqFeatureSimilarity

bull BioSeqFeatureGene

bull BioSeqFeatureGeneExonI

bull BioSeqFeatureGeneExon

bull BioSeqFeatureGeneTranscript

bull BioSeqFeatureGeneTranscriptI

bull BioSeqFeatureGeneGeneStructureI

bull BioSeqFeatureGeneGeneStructure

12

Chapter 2 Sequences

Figure 25 Features Classes structure

232 Code reading extracting CDS

Exercise 26 extractcds

code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )

fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])

13

Chapter 2 Sequences

Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]

Tip

Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class

Tip

Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class

Exercise 27 extractcds version 2

compare Exercise 26 with this version [lecture_codeextractcds2]

233 Tag system

This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])

In the following exampleCDS(feature key) is a primary tagcodon a secondary tag

CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)

The tag system also enables to create new qualifier types Examples in this tutorial

bull Blast hits as features (see Figure 41 and Exercise 416)

bull Creation of a database from a Genscan parsing (Exercise 51)

See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation

14

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 9: Bio Perl

Chapter 1 General introduction

2

Chapter 2 Sequences

Chapter 2 Sequences

21 The BioSeqIO class

211 Format Converter

Example 21 SwissProt -gt Fasta

pseudocode

bull Construct aSwissProtformatted sequences input stream

bull Construct afastaformatted sequences output stream

bull for all sequences

bull read the sequence from input stream

bull write it to output stream

in bioperl

localbinperl -w

use strict

use BioSeqIO

my $in = BioSeqIO-gtnewFh ( -file =gt rsquoltseqssprsquo-format =gt rsquoswissrsquo )

my $out = BioSeqIO-gtnewFh ( -file =gt rsquogtseqsfastarsquo-format =gt rsquofastarsquo )

print $out $_ while lt$ingt

datasseqssp [dataseqssp]

3

Chapter 2 Sequences

Exercise 21 BioSeqIO

Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)

Exercise 22 An universal converter

Modify Example 21 to program an universal converter (Solution A2)

22 Sequence classes

221 Introduction

Example 22 Loading a sequence from a remote server

pseudocode

bull Create a sequence object for theBACR_HALHASwissProtentry

bull Print its Accession number and description

in bioperl

use BioDBSwissProt

my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)

print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn

4

Chapter 2 Sequences

Exercise 23 Display a sequence in fasta format

Display theBACR_HALHAentry infastaformat (Solution A3)

222 Building mechanisms summary

de novo

use BioSeq

my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)

from a file

use BioSeqIO

my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq3 = $seqin-gtnext_seq()

my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo

-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt

by fetching an entry from a remote database

use BioDBSwissProt

my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)

by fetching an entry from a bioperl indexed local database

use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)

5

Chapter 2 Sequences

data small_swissinx [datasmall_swissinx]

223 A deeper insight into the BioSeq class

Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components

Figure 21 BioSeq class structure

6

Chapter 2 Sequences

7

Chapter 2 Sequences

Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents

8

Chapter 2 Sequences

Figure 23 BioAnnotation package structure

Example 23 Find the references to the PDB database entries

Does the protein have a known structure and if so what are its cristallographic structures

Where can you find this information within a SwissProt entry

Where can you find this information in the sequence object

9

Chapter 2 Sequences

in bioperl

PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )

if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())

print nPDB Structures join ( structures) n

Exercise 24 More on annotations

Find the function of the protein within comments (if available) (Solution A4)

Go to

Work on Exercise 26 before you continue

Exercise 25 Transmembran helices

Find transmembrane helices in the entry (Solution A5)

Tip

Use Figure 22 again

Use the appropriate objectrsquos documentation

Here is a summary [codesuseSeq10pl] solution of the exercises in this section

224 Bioperl rsquoSequencersquo classes structure

10

Chapter 2 Sequences

Figure 24 Bioperl rsquoSequencersquo classes structure

23 Features and Location classes

231 Feature

11

Chapter 2 Sequences

They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)

Documentation on features

bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]

bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]

Features Classes

bull BioSeqFeatureI

bull BioSeqFeatureGeneric

bull BioSeqFeatureFeaturePair

bull BioSeqFeatureSimilarityPair

bull BioSeqFeatureSimilarity

bull BioSeqFeatureGene

bull BioSeqFeatureGeneExonI

bull BioSeqFeatureGeneExon

bull BioSeqFeatureGeneTranscript

bull BioSeqFeatureGeneTranscriptI

bull BioSeqFeatureGeneGeneStructureI

bull BioSeqFeatureGeneGeneStructure

12

Chapter 2 Sequences

Figure 25 Features Classes structure

232 Code reading extracting CDS

Exercise 26 extractcds

code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )

fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])

13

Chapter 2 Sequences

Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]

Tip

Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class

Tip

Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class

Exercise 27 extractcds version 2

compare Exercise 26 with this version [lecture_codeextractcds2]

233 Tag system

This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])

In the following exampleCDS(feature key) is a primary tagcodon a secondary tag

CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)

The tag system also enables to create new qualifier types Examples in this tutorial

bull Blast hits as features (see Figure 41 and Exercise 416)

bull Creation of a database from a Genscan parsing (Exercise 51)

See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation

14

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 10: Bio Perl

Chapter 2 Sequences

Chapter 2 Sequences

21 The BioSeqIO class

211 Format Converter

Example 21 SwissProt -gt Fasta

pseudocode

bull Construct aSwissProtformatted sequences input stream

bull Construct afastaformatted sequences output stream

bull for all sequences

bull read the sequence from input stream

bull write it to output stream

in bioperl

localbinperl -w

use strict

use BioSeqIO

my $in = BioSeqIO-gtnewFh ( -file =gt rsquoltseqssprsquo-format =gt rsquoswissrsquo )

my $out = BioSeqIO-gtnewFh ( -file =gt rsquogtseqsfastarsquo-format =gt rsquofastarsquo )

print $out $_ while lt$ingt

datasseqssp [dataseqssp]

3

Chapter 2 Sequences

Exercise 21 BioSeqIO

Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)

Exercise 22 An universal converter

Modify Example 21 to program an universal converter (Solution A2)

22 Sequence classes

221 Introduction

Example 22 Loading a sequence from a remote server

pseudocode

bull Create a sequence object for theBACR_HALHASwissProtentry

bull Print its Accession number and description

in bioperl

use BioDBSwissProt

my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)

print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn

4

Chapter 2 Sequences

Exercise 23 Display a sequence in fasta format

Display theBACR_HALHAentry infastaformat (Solution A3)

222 Building mechanisms summary

de novo

use BioSeq

my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)

from a file

use BioSeqIO

my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq3 = $seqin-gtnext_seq()

my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo

-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt

by fetching an entry from a remote database

use BioDBSwissProt

my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)

by fetching an entry from a bioperl indexed local database

use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)

5

Chapter 2 Sequences

data small_swissinx [datasmall_swissinx]

223 A deeper insight into the BioSeq class

Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components

Figure 21 BioSeq class structure

6

Chapter 2 Sequences

7

Chapter 2 Sequences

Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents

8

Chapter 2 Sequences

Figure 23 BioAnnotation package structure

Example 23 Find the references to the PDB database entries

Does the protein have a known structure and if so what are its cristallographic structures

Where can you find this information within a SwissProt entry

Where can you find this information in the sequence object

9

Chapter 2 Sequences

in bioperl

PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )

if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())

print nPDB Structures join ( structures) n

Exercise 24 More on annotations

Find the function of the protein within comments (if available) (Solution A4)

Go to

Work on Exercise 26 before you continue

Exercise 25 Transmembran helices

Find transmembrane helices in the entry (Solution A5)

Tip

Use Figure 22 again

Use the appropriate objectrsquos documentation

Here is a summary [codesuseSeq10pl] solution of the exercises in this section

224 Bioperl rsquoSequencersquo classes structure

10

Chapter 2 Sequences

Figure 24 Bioperl rsquoSequencersquo classes structure

23 Features and Location classes

231 Feature

11

Chapter 2 Sequences

They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)

Documentation on features

bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]

bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]

Features Classes

bull BioSeqFeatureI

bull BioSeqFeatureGeneric

bull BioSeqFeatureFeaturePair

bull BioSeqFeatureSimilarityPair

bull BioSeqFeatureSimilarity

bull BioSeqFeatureGene

bull BioSeqFeatureGeneExonI

bull BioSeqFeatureGeneExon

bull BioSeqFeatureGeneTranscript

bull BioSeqFeatureGeneTranscriptI

bull BioSeqFeatureGeneGeneStructureI

bull BioSeqFeatureGeneGeneStructure

12

Chapter 2 Sequences

Figure 25 Features Classes structure

232 Code reading extracting CDS

Exercise 26 extractcds

code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )

fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])

13

Chapter 2 Sequences

Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]

Tip

Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class

Tip

Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class

Exercise 27 extractcds version 2

compare Exercise 26 with this version [lecture_codeextractcds2]

233 Tag system

This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])

In the following exampleCDS(feature key) is a primary tagcodon a secondary tag

CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)

The tag system also enables to create new qualifier types Examples in this tutorial

bull Blast hits as features (see Figure 41 and Exercise 416)

bull Creation of a database from a Genscan parsing (Exercise 51)

See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation

14

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 11: Bio Perl

Chapter 2 Sequences

Exercise 21 BioSeqIO

Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)

Exercise 22 An universal converter

Modify Example 21 to program an universal converter (Solution A2)

22 Sequence classes

221 Introduction

Example 22 Loading a sequence from a remote server

pseudocode

bull Create a sequence object for theBACR_HALHASwissProtentry

bull Print its Accession number and description

in bioperl

use BioDBSwissProt

my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)

print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn

4

Chapter 2 Sequences

Exercise 23 Display a sequence in fasta format

Display theBACR_HALHAentry infastaformat (Solution A3)

222 Building mechanisms summary

de novo

use BioSeq

my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)

from a file

use BioSeqIO

my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq3 = $seqin-gtnext_seq()

my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo

-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt

by fetching an entry from a remote database

use BioDBSwissProt

my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)

by fetching an entry from a bioperl indexed local database

use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)

5

Chapter 2 Sequences

data small_swissinx [datasmall_swissinx]

223 A deeper insight into the BioSeq class

Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components

Figure 21 BioSeq class structure

6

Chapter 2 Sequences

7

Chapter 2 Sequences

Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents

8

Chapter 2 Sequences

Figure 23 BioAnnotation package structure

Example 23 Find the references to the PDB database entries

Does the protein have a known structure and if so what are its cristallographic structures

Where can you find this information within a SwissProt entry

Where can you find this information in the sequence object

9

Chapter 2 Sequences

in bioperl

PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )

if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())

print nPDB Structures join ( structures) n

Exercise 24 More on annotations

Find the function of the protein within comments (if available) (Solution A4)

Go to

Work on Exercise 26 before you continue

Exercise 25 Transmembran helices

Find transmembrane helices in the entry (Solution A5)

Tip

Use Figure 22 again

Use the appropriate objectrsquos documentation

Here is a summary [codesuseSeq10pl] solution of the exercises in this section

224 Bioperl rsquoSequencersquo classes structure

10

Chapter 2 Sequences

Figure 24 Bioperl rsquoSequencersquo classes structure

23 Features and Location classes

231 Feature

11

Chapter 2 Sequences

They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)

Documentation on features

bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]

bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]

Features Classes

bull BioSeqFeatureI

bull BioSeqFeatureGeneric

bull BioSeqFeatureFeaturePair

bull BioSeqFeatureSimilarityPair

bull BioSeqFeatureSimilarity

bull BioSeqFeatureGene

bull BioSeqFeatureGeneExonI

bull BioSeqFeatureGeneExon

bull BioSeqFeatureGeneTranscript

bull BioSeqFeatureGeneTranscriptI

bull BioSeqFeatureGeneGeneStructureI

bull BioSeqFeatureGeneGeneStructure

12

Chapter 2 Sequences

Figure 25 Features Classes structure

232 Code reading extracting CDS

Exercise 26 extractcds

code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )

fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])

13

Chapter 2 Sequences

Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]

Tip

Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class

Tip

Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class

Exercise 27 extractcds version 2

compare Exercise 26 with this version [lecture_codeextractcds2]

233 Tag system

This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])

In the following exampleCDS(feature key) is a primary tagcodon a secondary tag

CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)

The tag system also enables to create new qualifier types Examples in this tutorial

bull Blast hits as features (see Figure 41 and Exercise 416)

bull Creation of a database from a Genscan parsing (Exercise 51)

See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation

14

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 12: Bio Perl

Chapter 2 Sequences

Exercise 23 Display a sequence in fasta format

Display theBACR_HALHAentry infastaformat (Solution A3)

222 Building mechanisms summary

de novo

use BioSeq

my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)

from a file

use BioSeqIO

my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq3 = $seqin-gtnext_seq()

my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)

my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo

-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt

by fetching an entry from a remote database

use BioDBSwissProt

my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)

by fetching an entry from a bioperl indexed local database

use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)

5

Chapter 2 Sequences

data small_swissinx [datasmall_swissinx]

223 A deeper insight into the BioSeq class

Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components

Figure 21 BioSeq class structure

6

Chapter 2 Sequences

7

Chapter 2 Sequences

Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents

8

Chapter 2 Sequences

Figure 23 BioAnnotation package structure

Example 23 Find the references to the PDB database entries

Does the protein have a known structure and if so what are its cristallographic structures

Where can you find this information within a SwissProt entry

Where can you find this information in the sequence object

9

Chapter 2 Sequences

in bioperl

PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )

if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())

print nPDB Structures join ( structures) n

Exercise 24 More on annotations

Find the function of the protein within comments (if available) (Solution A4)

Go to

Work on Exercise 26 before you continue

Exercise 25 Transmembran helices

Find transmembrane helices in the entry (Solution A5)

Tip

Use Figure 22 again

Use the appropriate objectrsquos documentation

Here is a summary [codesuseSeq10pl] solution of the exercises in this section

224 Bioperl rsquoSequencersquo classes structure

10

Chapter 2 Sequences

Figure 24 Bioperl rsquoSequencersquo classes structure

23 Features and Location classes

231 Feature

11

Chapter 2 Sequences

They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)

Documentation on features

bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]

bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]

Features Classes

bull BioSeqFeatureI

bull BioSeqFeatureGeneric

bull BioSeqFeatureFeaturePair

bull BioSeqFeatureSimilarityPair

bull BioSeqFeatureSimilarity

bull BioSeqFeatureGene

bull BioSeqFeatureGeneExonI

bull BioSeqFeatureGeneExon

bull BioSeqFeatureGeneTranscript

bull BioSeqFeatureGeneTranscriptI

bull BioSeqFeatureGeneGeneStructureI

bull BioSeqFeatureGeneGeneStructure

12

Chapter 2 Sequences

Figure 25 Features Classes structure

232 Code reading extracting CDS

Exercise 26 extractcds

code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )

fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])

13

Chapter 2 Sequences

Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]

Tip

Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class

Tip

Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class

Exercise 27 extractcds version 2

compare Exercise 26 with this version [lecture_codeextractcds2]

233 Tag system

This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])

In the following exampleCDS(feature key) is a primary tagcodon a secondary tag

CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)

The tag system also enables to create new qualifier types Examples in this tutorial

bull Blast hits as features (see Figure 41 and Exercise 416)

bull Creation of a database from a Genscan parsing (Exercise 51)

See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation

14

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 13: Bio Perl

Chapter 2 Sequences

data small_swissinx [datasmall_swissinx]

223 A deeper insight into the BioSeq class

Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components

Figure 21 BioSeq class structure

6

Chapter 2 Sequences

7

Chapter 2 Sequences

Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents

8

Chapter 2 Sequences

Figure 23 BioAnnotation package structure

Example 23 Find the references to the PDB database entries

Does the protein have a known structure and if so what are its cristallographic structures

Where can you find this information within a SwissProt entry

Where can you find this information in the sequence object

9

Chapter 2 Sequences

in bioperl

PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )

if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())

print nPDB Structures join ( structures) n

Exercise 24 More on annotations

Find the function of the protein within comments (if available) (Solution A4)

Go to

Work on Exercise 26 before you continue

Exercise 25 Transmembran helices

Find transmembrane helices in the entry (Solution A5)

Tip

Use Figure 22 again

Use the appropriate objectrsquos documentation

Here is a summary [codesuseSeq10pl] solution of the exercises in this section

224 Bioperl rsquoSequencersquo classes structure

10

Chapter 2 Sequences

Figure 24 Bioperl rsquoSequencersquo classes structure

23 Features and Location classes

231 Feature

11

Chapter 2 Sequences

They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)

Documentation on features

bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]

bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]

Features Classes

bull BioSeqFeatureI

bull BioSeqFeatureGeneric

bull BioSeqFeatureFeaturePair

bull BioSeqFeatureSimilarityPair

bull BioSeqFeatureSimilarity

bull BioSeqFeatureGene

bull BioSeqFeatureGeneExonI

bull BioSeqFeatureGeneExon

bull BioSeqFeatureGeneTranscript

bull BioSeqFeatureGeneTranscriptI

bull BioSeqFeatureGeneGeneStructureI

bull BioSeqFeatureGeneGeneStructure

12

Chapter 2 Sequences

Figure 25 Features Classes structure

232 Code reading extracting CDS

Exercise 26 extractcds

code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )

fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])

13

Chapter 2 Sequences

Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]

Tip

Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class

Tip

Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class

Exercise 27 extractcds version 2

compare Exercise 26 with this version [lecture_codeextractcds2]

233 Tag system

This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])

In the following exampleCDS(feature key) is a primary tagcodon a secondary tag

CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)

The tag system also enables to create new qualifier types Examples in this tutorial

bull Blast hits as features (see Figure 41 and Exercise 416)

bull Creation of a database from a Genscan parsing (Exercise 51)

See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation

14

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 14: Bio Perl

Chapter 2 Sequences

7

Chapter 2 Sequences

Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents

8

Chapter 2 Sequences

Figure 23 BioAnnotation package structure

Example 23 Find the references to the PDB database entries

Does the protein have a known structure and if so what are its cristallographic structures

Where can you find this information within a SwissProt entry

Where can you find this information in the sequence object

9

Chapter 2 Sequences

in bioperl

PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )

if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())

print nPDB Structures join ( structures) n

Exercise 24 More on annotations

Find the function of the protein within comments (if available) (Solution A4)

Go to

Work on Exercise 26 before you continue

Exercise 25 Transmembran helices

Find transmembrane helices in the entry (Solution A5)

Tip

Use Figure 22 again

Use the appropriate objectrsquos documentation

Here is a summary [codesuseSeq10pl] solution of the exercises in this section

224 Bioperl rsquoSequencersquo classes structure

10

Chapter 2 Sequences

Figure 24 Bioperl rsquoSequencersquo classes structure

23 Features and Location classes

231 Feature

11

Chapter 2 Sequences

They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)

Documentation on features

bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]

bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]

Features Classes

bull BioSeqFeatureI

bull BioSeqFeatureGeneric

bull BioSeqFeatureFeaturePair

bull BioSeqFeatureSimilarityPair

bull BioSeqFeatureSimilarity

bull BioSeqFeatureGene

bull BioSeqFeatureGeneExonI

bull BioSeqFeatureGeneExon

bull BioSeqFeatureGeneTranscript

bull BioSeqFeatureGeneTranscriptI

bull BioSeqFeatureGeneGeneStructureI

bull BioSeqFeatureGeneGeneStructure

12

Chapter 2 Sequences

Figure 25 Features Classes structure

232 Code reading extracting CDS

Exercise 26 extractcds

code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )

fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])

13

Chapter 2 Sequences

Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]

Tip

Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class

Tip

Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class

Exercise 27 extractcds version 2

compare Exercise 26 with this version [lecture_codeextractcds2]

233 Tag system

This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])

In the following exampleCDS(feature key) is a primary tagcodon a secondary tag

CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)

The tag system also enables to create new qualifier types Examples in this tutorial

bull Blast hits as features (see Figure 41 and Exercise 416)

bull Creation of a database from a Genscan parsing (Exercise 51)

See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation

14

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 15: Bio Perl

Chapter 2 Sequences

Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents

8

Chapter 2 Sequences

Figure 23 BioAnnotation package structure

Example 23 Find the references to the PDB database entries

Does the protein have a known structure and if so what are its cristallographic structures

Where can you find this information within a SwissProt entry

Where can you find this information in the sequence object

9

Chapter 2 Sequences

in bioperl

PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )

if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())

print nPDB Structures join ( structures) n

Exercise 24 More on annotations

Find the function of the protein within comments (if available) (Solution A4)

Go to

Work on Exercise 26 before you continue

Exercise 25 Transmembran helices

Find transmembrane helices in the entry (Solution A5)

Tip

Use Figure 22 again

Use the appropriate objectrsquos documentation

Here is a summary [codesuseSeq10pl] solution of the exercises in this section

224 Bioperl rsquoSequencersquo classes structure

10

Chapter 2 Sequences

Figure 24 Bioperl rsquoSequencersquo classes structure

23 Features and Location classes

231 Feature

11

Chapter 2 Sequences

They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)

Documentation on features

bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]

bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]

Features Classes

bull BioSeqFeatureI

bull BioSeqFeatureGeneric

bull BioSeqFeatureFeaturePair

bull BioSeqFeatureSimilarityPair

bull BioSeqFeatureSimilarity

bull BioSeqFeatureGene

bull BioSeqFeatureGeneExonI

bull BioSeqFeatureGeneExon

bull BioSeqFeatureGeneTranscript

bull BioSeqFeatureGeneTranscriptI

bull BioSeqFeatureGeneGeneStructureI

bull BioSeqFeatureGeneGeneStructure

12

Chapter 2 Sequences

Figure 25 Features Classes structure

232 Code reading extracting CDS

Exercise 26 extractcds

code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )

fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])

13

Chapter 2 Sequences

Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]

Tip

Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class

Tip

Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class

Exercise 27 extractcds version 2

compare Exercise 26 with this version [lecture_codeextractcds2]

233 Tag system

This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])

In the following exampleCDS(feature key) is a primary tagcodon a secondary tag

CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)

The tag system also enables to create new qualifier types Examples in this tutorial

bull Blast hits as features (see Figure 41 and Exercise 416)

bull Creation of a database from a Genscan parsing (Exercise 51)

See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation

14

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 16: Bio Perl

Chapter 2 Sequences

Figure 23 BioAnnotation package structure

Example 23 Find the references to the PDB database entries

Does the protein have a known structure and if so what are its cristallographic structures

Where can you find this information within a SwissProt entry

Where can you find this information in the sequence object

9

Chapter 2 Sequences

in bioperl

PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )

if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())

print nPDB Structures join ( structures) n

Exercise 24 More on annotations

Find the function of the protein within comments (if available) (Solution A4)

Go to

Work on Exercise 26 before you continue

Exercise 25 Transmembran helices

Find transmembrane helices in the entry (Solution A5)

Tip

Use Figure 22 again

Use the appropriate objectrsquos documentation

Here is a summary [codesuseSeq10pl] solution of the exercises in this section

224 Bioperl rsquoSequencersquo classes structure

10

Chapter 2 Sequences

Figure 24 Bioperl rsquoSequencersquo classes structure

23 Features and Location classes

231 Feature

11

Chapter 2 Sequences

They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)

Documentation on features

bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]

bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]

Features Classes

bull BioSeqFeatureI

bull BioSeqFeatureGeneric

bull BioSeqFeatureFeaturePair

bull BioSeqFeatureSimilarityPair

bull BioSeqFeatureSimilarity

bull BioSeqFeatureGene

bull BioSeqFeatureGeneExonI

bull BioSeqFeatureGeneExon

bull BioSeqFeatureGeneTranscript

bull BioSeqFeatureGeneTranscriptI

bull BioSeqFeatureGeneGeneStructureI

bull BioSeqFeatureGeneGeneStructure

12

Chapter 2 Sequences

Figure 25 Features Classes structure

232 Code reading extracting CDS

Exercise 26 extractcds

code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )

fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])

13

Chapter 2 Sequences

Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]

Tip

Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class

Tip

Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class

Exercise 27 extractcds version 2

compare Exercise 26 with this version [lecture_codeextractcds2]

233 Tag system

This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])

In the following exampleCDS(feature key) is a primary tagcodon a secondary tag

CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)

The tag system also enables to create new qualifier types Examples in this tutorial

bull Blast hits as features (see Figure 41 and Exercise 416)

bull Creation of a database from a Genscan parsing (Exercise 51)

See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation

14

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 17: Bio Perl

Chapter 2 Sequences

in bioperl

PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )

if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())

print nPDB Structures join ( structures) n

Exercise 24 More on annotations

Find the function of the protein within comments (if available) (Solution A4)

Go to

Work on Exercise 26 before you continue

Exercise 25 Transmembran helices

Find transmembrane helices in the entry (Solution A5)

Tip

Use Figure 22 again

Use the appropriate objectrsquos documentation

Here is a summary [codesuseSeq10pl] solution of the exercises in this section

224 Bioperl rsquoSequencersquo classes structure

10

Chapter 2 Sequences

Figure 24 Bioperl rsquoSequencersquo classes structure

23 Features and Location classes

231 Feature

11

Chapter 2 Sequences

They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)

Documentation on features

bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]

bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]

Features Classes

bull BioSeqFeatureI

bull BioSeqFeatureGeneric

bull BioSeqFeatureFeaturePair

bull BioSeqFeatureSimilarityPair

bull BioSeqFeatureSimilarity

bull BioSeqFeatureGene

bull BioSeqFeatureGeneExonI

bull BioSeqFeatureGeneExon

bull BioSeqFeatureGeneTranscript

bull BioSeqFeatureGeneTranscriptI

bull BioSeqFeatureGeneGeneStructureI

bull BioSeqFeatureGeneGeneStructure

12

Chapter 2 Sequences

Figure 25 Features Classes structure

232 Code reading extracting CDS

Exercise 26 extractcds

code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )

fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])

13

Chapter 2 Sequences

Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]

Tip

Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class

Tip

Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class

Exercise 27 extractcds version 2

compare Exercise 26 with this version [lecture_codeextractcds2]

233 Tag system

This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])

In the following exampleCDS(feature key) is a primary tagcodon a secondary tag

CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)

The tag system also enables to create new qualifier types Examples in this tutorial

bull Blast hits as features (see Figure 41 and Exercise 416)

bull Creation of a database from a Genscan parsing (Exercise 51)

See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation

14

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 18: Bio Perl

Chapter 2 Sequences

Figure 24 Bioperl rsquoSequencersquo classes structure

23 Features and Location classes

231 Feature

11

Chapter 2 Sequences

They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)

Documentation on features

bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]

bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]

Features Classes

bull BioSeqFeatureI

bull BioSeqFeatureGeneric

bull BioSeqFeatureFeaturePair

bull BioSeqFeatureSimilarityPair

bull BioSeqFeatureSimilarity

bull BioSeqFeatureGene

bull BioSeqFeatureGeneExonI

bull BioSeqFeatureGeneExon

bull BioSeqFeatureGeneTranscript

bull BioSeqFeatureGeneTranscriptI

bull BioSeqFeatureGeneGeneStructureI

bull BioSeqFeatureGeneGeneStructure

12

Chapter 2 Sequences

Figure 25 Features Classes structure

232 Code reading extracting CDS

Exercise 26 extractcds

code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )

fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])

13

Chapter 2 Sequences

Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]

Tip

Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class

Tip

Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class

Exercise 27 extractcds version 2

compare Exercise 26 with this version [lecture_codeextractcds2]

233 Tag system

This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])

In the following exampleCDS(feature key) is a primary tagcodon a secondary tag

CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)

The tag system also enables to create new qualifier types Examples in this tutorial

bull Blast hits as features (see Figure 41 and Exercise 416)

bull Creation of a database from a Genscan parsing (Exercise 51)

See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation

14

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 19: Bio Perl

Chapter 2 Sequences

They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)

Documentation on features

bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]

bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]

Features Classes

bull BioSeqFeatureI

bull BioSeqFeatureGeneric

bull BioSeqFeatureFeaturePair

bull BioSeqFeatureSimilarityPair

bull BioSeqFeatureSimilarity

bull BioSeqFeatureGene

bull BioSeqFeatureGeneExonI

bull BioSeqFeatureGeneExon

bull BioSeqFeatureGeneTranscript

bull BioSeqFeatureGeneTranscriptI

bull BioSeqFeatureGeneGeneStructureI

bull BioSeqFeatureGeneGeneStructure

12

Chapter 2 Sequences

Figure 25 Features Classes structure

232 Code reading extracting CDS

Exercise 26 extractcds

code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )

fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])

13

Chapter 2 Sequences

Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]

Tip

Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class

Tip

Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class

Exercise 27 extractcds version 2

compare Exercise 26 with this version [lecture_codeextractcds2]

233 Tag system

This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])

In the following exampleCDS(feature key) is a primary tagcodon a secondary tag

CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)

The tag system also enables to create new qualifier types Examples in this tutorial

bull Blast hits as features (see Figure 41 and Exercise 416)

bull Creation of a database from a Genscan parsing (Exercise 51)

See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation

14

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 20: Bio Perl

Chapter 2 Sequences

Figure 25 Features Classes structure

232 Code reading extracting CDS

Exercise 26 extractcds

code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )

fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])

13

Chapter 2 Sequences

Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]

Tip

Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class

Tip

Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class

Exercise 27 extractcds version 2

compare Exercise 26 with this version [lecture_codeextractcds2]

233 Tag system

This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])

In the following exampleCDS(feature key) is a primary tagcodon a secondary tag

CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)

The tag system also enables to create new qualifier types Examples in this tutorial

bull Blast hits as features (see Figure 41 and Exercise 416)

bull Creation of a database from a Genscan parsing (Exercise 51)

See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation

14

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 21: Bio Perl

Chapter 2 Sequences

Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]

Tip

Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class

Tip

Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class

Exercise 27 extractcds version 2

compare Exercise 26 with this version [lecture_codeextractcds2]

233 Tag system

This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])

In the following exampleCDS(feature key) is a primary tagcodon a secondary tag

CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)

The tag system also enables to create new qualifier types Examples in this tutorial

bull Blast hits as features (see Figure 41 and Exercise 416)

bull Creation of a database from a Genscan parsing (Exercise 51)

See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation

14

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 22: Bio Perl

Chapter 2 Sequences

Figure 26 Correspondance between an EMBL entry and bioperl tags

234 Location

Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment

Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation

15

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 23: Bio Perl

Chapter 2 Sequences

[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]

my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)

which specifies a location whose begin is before 30 and whose end is at 90 Coding

bull rsquorsquo EXACT

bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177

bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive

bull rsquoltrsquo BEFORE

bull rsquogtrsquo AFTER

Location Classes

bull BioLocationI

bull BioLocationSimple

bull BioLocationSplitLocationIet BioLocationSplit

bull BioLocationFuzzyLocationIet BioLocationFuzzy

bull Coordinate policies for fuzzy location start and end determination

bull interface BioLocationCoordinatePolicyI

bull default BioLocationWidestCoordPolicy

bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy

16

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 24: Bio Perl

Chapter 2 Sequences

Figure 27 Location Classes Structure

235 Graphical view of features

The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs

17

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 25: Bio Perl

Chapter 2 Sequences

Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]

Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA

24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])

$seqobj-gtseq

sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq

18

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 26: Bio Perl

Chapter 2 Sequences

$seqobj-gttranslate

BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])

$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)

et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]

$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)

BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])

$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok

BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]

$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)

or

$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)

BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])

$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one

19

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 27: Bio Perl

Chapter 2 Sequences

-MAKE =gtrsquocustomrsquo)

BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])

$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo

rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)

raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print

20

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 28: Bio Perl

Chapter 3 Alignments

Chapter 3 Alignments

31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing

Example 31 Format conversions with AlignIO

(data [datainfilealn])

localbinperl -w

use strictuse BioAlignIO

my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )

print $out $_ while lt$ingt

21

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 29: Bio Perl

Chapter 3 Alignments

AlignIO class structureWarning some alignment formats are available only as input formats

Figure 31 AlignIO Classes diagram

32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class

Example 32 Basic methods of SimpleAlign

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()

print same length of all sequences ($aln-gtis_flush()) yes no n

print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()

22

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 30: Bio Perl

Chapter 3 Alignments

23

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 31: Bio Perl

Chapter 3 Alignments

SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects

Figure 32 Align Classes diagram

The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by

24

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 32: Bio Perl

Chapter 3 Alignments

properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns

Example 33 Filter gap columns

localbinperl -w

use strictuse BioAlignIO

my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )

my $aln = $in-gtnext_aln()

if you need to working on the columns create a list containing all columns as strings

my aln_cols = ()

foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )

$aln_cols[$colnr] = $chr$colnr++

then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap

my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )

next if $col =~ Q$gapcharEpush no_gap_cols $col

2 we modify the sequences in the alignment 2a we reconstruct the sequence strings

my seq_strs = ()foreach my $col ( no_gap_cols )

my $colnr = 0foreach my $chr ( split $col )

$seq_strs[$colnr] = $chr$colnr++

2b we replace the old sequences strings with the new ones

25

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 33: Bio Perl

Chapter 3 Alignments

foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)

print $out $aln

Exercise 31 Create an alignment without gaps

Create a new alignment without gaps at the begining and the end of the alignment

bull sample alignment file [datainfilealn]

bull Hint you donrsquot have to extract all the columns as in the second example

Solution A6

33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment

bull code [lecture_codeprotal2dnahtml] (download)

bull bioperl classes

bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]

bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]

bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]

bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]

26

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 34: Bio Perl

Chapter 4 Analysis

Chapter 4 Analysis

41 Blast

411 Running Blast

4111 Running local Blast with BioToolsRunStandAloneBlast

Example 41 StandAloneBlast run

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

You can try it with this file [data1protfasta]

Exercise 41 Running Blast on a Swissprot entry

Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7

Exercise 42 Running Blast Setting parameters

Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8

Exercise 43 Running Blast Saving output

Save the blast report in a local file (egblastout )Solution A9

27

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 35: Bio Perl

Chapter 4 Analysis

Exercise 44 Running a Remote Blast

Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10

412 Parsing Blast

4121 Parsing from a Blast run result

pseudocode

bull Run Blast

bull Create a report

bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])

bull print its name and database

bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])

bull print some hit statistics

28

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 36: Bio Perl

Chapter 4 Analysis

Example 42 StandAloneBlast parsing

use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n

while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results

Exercise 45 Display Blast hits

Display hit accession and hsp lengthSolution A11

Exercise 46 Class of a Blast report

What is the class of$blast_report (hint use theref() perl function)Solution A12

4122 Parsing from a file

The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file

Example 43 Parsing from a Blast file

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

29

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 37: Bio Perl

Chapter 4 Analysis

my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Exercise 47 Parse a Blast output file

Run Example 43 with the file you have just created in Exercise 43

Exercise 48 Parse Blast results on standard input

Run Example 43 with the blast report given on standard inputSolution A13

4123 Blast Classes diagram

30

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 38: Bio Perl

Chapter 4 Analysis

Figure 41 Blast Classes diagram

4124 Filtering Blast output

Exercise 49 Filtering hits by length

Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14

31

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 39: Bio Perl

Chapter 4 Analysis

Exercise 410 Filtering hits by position

Only display hits between two positionsSolution A15

Exercise 411 Display the best hit by databank

Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])

bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name

bull You may use a perl hash to mark a database as already done

$done$database = 1

Solution A16

Exercise 412 Multiple queries

Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17

4125 Extracting data from a Blast report

Exercise 413 Extracting the subject sequence

Extract as a string the subject sequence from HSP with identity above a given levelSolution A18

Exercise 414 Extracting alignments

Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19

Exercise 415 Locate EST in a genome

Locate EST in a genome

bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome

bull You may first display this as a Clustalw alignment (on sequence by HSP)

bull Blast report [datablast-estout]

Solution A20

32

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 40: Bio Perl

Chapter 4 Analysis

Exercise 416 Record Blast hits as sequence features

Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21

413 BioToolsBPlite family parsers

The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)

Example 44 Parsing with BPLite

In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

33

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 41: Bio Perl

Chapter 4 Analysis

Exercise 417 Print informations from a BPLite report

Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22

Exercise 418 Create a BioToolsBPlite from a file

Create a BioToolsBPlite from a fileSolution A23

Exercise 419 Create a BioToolsBPlite from standard input

Create a BioToolsBPlite from standard inputSolution A24

34

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 42: Bio Perl

Chapter 4 Analysis

35

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 43: Bio Perl

Chapter 4 Analysis

Figure 42 BPLite Classes diagram

36

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 44: Bio Perl

Chapter 4 Analysis

414 PSI-BLAST (Position Specific Iterative Blast)

See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI

Example 45 Running PSI-blast

The following example run a PSI-blast

use BioSeqIOuse BioToolsRunStandAloneBlast

my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)

while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)

print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n

You can also load a report from a file or from standard input

my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)

Example 46 PSI-Blast SearchIO class

There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations

use BioSearchIO

my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )

while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )

37

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 45: Bio Perl

Chapter 4 Analysis

print Hit $hitn

Exercise 420 Parse a PSI-blast report

Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display

bull the hits that were not present at the previous iteration

bull the hits that have disappeared

Solution A25

415 bl2seq Blast 2 sequences

Example 47 Running bl2seq

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id

38

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 46: Bio Perl

Chapter 4 Analysis

-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

Exercise 421 Build a BioSimpleAlign object

Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26

416 Blast Internal classes structure

39

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 47: Bio Perl

Chapter 4 Analysis

The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes

Figure 43 Blast internal classes diagram

40

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 48: Bio Perl

Chapter 4 Analysis

42 Genscan

Example 48 Genscan parsing

The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])

use BioToolsGenscan

my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

Example 49 Genscan parsing with sub-sequences

(data Genscan output with CDS [datagenscan-cdsout])

Genscan example with sub-sequences

use BioToolsGenscan

my $genscan_file = $ARGV[0]

$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())

my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn

display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn

foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n

foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation

41

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 49: Bio Perl

Chapter 4 Analysis

print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n

print ---------------------------------n

42

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 50: Bio Perl

Chapter 4 Analysis

Figure 44 Genscan Classes Structure

43

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 51: Bio Perl

Chapter 4 Analysis

Exercise 422 Code reading BioToolsGenscan module

1What is the purpose of theBioSeqAnalysisParserI class

2Look at the_parse_predictions method inBioToolsGenscan

3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level

44

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 52: Bio Perl

Chapter 5 Databases

Chapter 5 Databases

51 Database classes

Example 51 Database class use

use BioIndexSwissprotuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq

Example 52 Database Index creation

(data a SwissProt file to index [datasmall_swissdat] )

use BioIndexSwissprot

my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name

-write_flag =gt 1)$inx-gtmake_index(ARGV)

45

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 53: Bio Perl

Chapter 5 Databases

Figure 51 Database Classes structure

Exercise 51 Parse a Genscan report and build a database entry

Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])

bull (a) construct a database in Embl format

46

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 54: Bio Perl

Chapter 5 Databases

bull

bull containing an entry for each Genscan predicted CDS

bull put this CDS exons as Features in the current entry

bull (b) index this Embl formatted database

bull (c) use your indexed database

Outline of the exercise

Solution A27

47

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 55: Bio Perl

Chapter 5 Databases

Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence

Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example

SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120

(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])

bull build an additional feature describing all the CDS

bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance

FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN

bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]

Solution A28

52 Accessing a local database with golden

Exercise 53 Build a small bioperl module (for the golden program)

Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this

48

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 56: Bio Perl

Chapter 5 Databases

use LocalDBAccessuse BioSeqIO

my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquofastarsquo )

print $out $seq

Solution A29

49

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 57: Bio Perl

Chapter 5 Databases

50

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 58: Bio Perl

Chapter 6 Perl Reminders

Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them

61 UML

Figure 61 UML meanings

62 Perl reminders to use bioperl modules

621 References

bull to pass a hash or a function as a parameter to a function

$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1

-filt_func =gt ampmy_filter)

bull for an output parameter

blastParam = (-run =gt runParam-save_array =gt blast_objs )

51

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 59: Bio Perl

Chapter 6 Perl Reminders

bull to pass a filehandle as a parameter

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

bull references on an anonymous array

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects

which is equivalent to

runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects

with a variable

bull references on anonymous data structures

(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n

reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n

reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n

See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)

622 Filehandles and streams

bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file

52

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 60: Bio Perl

Chapter 6 Perl Reminders

open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt

read a sequence objectprint OUTFILE $sequence write a sequence object

bull In bioperl there are classes that build filehandles

$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object

bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob

my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)

typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)

623 Exceptions

bull When using a bioperl module the following can happen

-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130

STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85

STACK toplevel solutionblast_featurespl22-------------------------------------------

bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught

53

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 61: Bio Perl

Chapter 6 Perl Reminders

bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class

sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n

die $out

bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block

eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )

if( $ )

print Caught exceptionelse

print no exception

bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]

sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )

$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)

else

confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)

is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods

54

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 62: Bio Perl

Chapter 6 Perl Reminders

624 GetoptStd

bull To handle command line optionsman GetoptStd

bull Example

use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

bull Another example

use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank

625 Classes

bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]

ISA = qw(BioRootRootI BioSeqI)

bull In bioperl you use classes the following ways

bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )

$seq_stats = BioToolsSeqStats-gtnew ($seqobj)

bull As a library component

55

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 63: Bio Perl

Chapter 6 Perl Reminders

bull Test whether a class or an objectis-a

if (BioToolsBlast-gtisa(BioRootRootI))

if ($seq-gtisa(BioSeqRichSeq))

bull find the class of an object

print class of exon ref($exon) n

626 BEGIN block

Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance

BioLocationI to define the default coordinate policy

BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()

Remark this is necessary in object-oriented style since everything is encapsulated in methods

63 Perl reminders for a further advanced understanding ofbioperl modules

631 Modules

bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)

bull EXPORT= qw() symbols exported by default

bull EXPORT_OK = qw() symbols imported on demand

56

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 64: Bio Perl

Chapter 6 Perl Reminders

bull EXPORT_TAGS = tag =gt [] tags for symbols sets example

use BioToolsBlast qw(obj)

imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods

632 Compiler instructions

bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords

bull use vars qw(ISA valid_type) a pragmato pre-declare global variables

633 Tie

Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)

sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s

which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods

sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list

sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)

These methods enable the programmer to read and print through the variable

57

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 65: Bio Perl

Chapter 6 Perl Reminders

$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)

58

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 66: Bio Perl

Appendix A Solutions

Appendix A Solutions

A1 Sequences

Solution A1 BioSeqIO structure

Exercise 21

59

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 67: Bio Perl

Appendix A Solutions

Figure A1 BioSeqIO structure

Solution A2 An universal converter

Exercise 22

UnivConvert via a file

60

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 68: Bio Perl

Appendix A Solutions

localbinperl -w

exercise 11 UnivConvert via file

use strict

use BioSeqIO

my $inform = shift ARGVmy $outform = shift ARGV

my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )

$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl UnivConvert2pl swiss gcg seqssp

UnivConvert via stream

localbinperl -w

exercise 11 UnivConvert via flux

use strict

use BioSeqIO

my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT

61

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 69: Bio Perl

Appendix A Solutions

-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

cat seqssp | perl solutionUnivConvertStdminpl swiss gcg

UnivConvert with options

localbinperl -w

exercice 11 UnivConvert avec options

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquoioIOrsquo opts)

my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo

my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )

print $out $_ while lt$ingt

You can use it like this

perl solutionUnivConvertpl -I seqssp -O seqsfasta

UnivConvert with options and stream

localbinperl -w

62

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 70: Bio Perl

Appendix A Solutions

exercise 11 UnivConvert with option and stream

use strictuse GetoptStd

use BioSeqIO

my opts = ()getopts(rsquohiorsquo opts)

usage() if (exists $optsrsquohrsquo)

my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo

(format_ok($inform) ampamp format_ok($outform)) || exit 1

my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )

print $out $_ while lt$ingt

sub format_ok

my $form = shift _

my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)

if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false

return 1 true

sub usage

my $prog =lsquobasename $0lsquo chomp($prog)

print nprint $prog convert files from one format into another formatnprint n

63

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 71: Bio Perl

Appendix A Solutions

print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn

print n

exit 1

You can use it like this

cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg

Solution A3 Display a sequence in fasta format

Exercise 23

rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()

my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )

if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())

Solution A4 More on annotations

Exercise 24

get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )

my lines = split (n $comment-gttext())foreach my $line (lines)

if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line

64

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 72: Bio Perl

Appendix A Solutions

last

last if $function ne

print nFunction $function n

Solution A5 Transmembran helices

Exercise 25

almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()

foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)

print TM Segment from $feat-gtstart() to $feat-gtend() n

A2 Alignments

Solution A6 Create an alignment without gaps

Exercise 31

localbinperl -w

Create a new alignment without gaps at the begining and the end of an alignment

use strictuse BioAlignIO

get alignment

my $inif ( $ARGV gt -1 )

$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else

$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )

65

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 73: Bio Perl

Appendix A Solutions

my $aln = $in-gtnext_aln()

find max column index of starting gaps and min column index of ending gaps

my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)

my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len

cut the starting and ending block containing gaps

print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)

A3 Analysis

Solution A7 Running Blast on a Swissprot entry

Exercise 41

use BioSeqIOuse BioToolsRunStandAloneBlast

a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)

my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()

b) else

use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)

my $factory = BioToolsRunStandAloneBlast-gtnew(

66

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 74: Bio Perl

Appendix A Solutions

rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A8 Running Blast Setting parameters

Exercise 42

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A9 Running Blast Saving output

Exercise 43

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo

67

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 75: Bio Perl

Appendix A Solutions

rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast

)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() significance $hit-gtsignificance() n

Solution A10 Running a remote Blast

Exercise 44

use BioSeqIOuse BioToolsRunRemoteBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtsubmit_blast($query)

while ( my rids = $factory-gteach_rid )

print STDERR waitingn

RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )

my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )

if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)

retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5

else

---- Blast done ----$factory-gtremove_rid($rid)

68

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 76: Bio Perl

Appendix A Solutions

my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )

print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )

print score is $hsp-gtscore n

Solution A11 Display Blast hits

Exercise 45

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())

print length $hsp-gtlength() n

Solution A12 Class of a Blast report

Exercise 46

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

69

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 77: Bio Perl

Appendix A Solutions

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo

_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n

Solution A13 Parse Blast results on standard input

Exercise 48

use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-fhrsquo =gt STDIN)

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

One way to use this code is by feeding ablastall output directly to the script through a Unix pipe

blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl

Solution A14 Filtering hits by length

Exercise 49

use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())

70

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 78: Bio Perl

Appendix A Solutions

print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A15 Filtering hits by position

Exercise 410

use BioSearchIO

my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A16 Display the best hit by databank

Exercise 411

use BioSearchIO

sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo

71

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 79: Bio Perl

Appendix A Solutions

rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result

while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)

next else

$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())

print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n

Solution A17 Multiple queries

Exercise 412

use BioSeqIOuse BioToolsRunStandAloneBlast

my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]

-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())

my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)

push (query_seqs $Seq2_in-gtnext_seq())

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)

my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())

while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast

72

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 80: Bio Perl

Appendix A Solutions

Solution A18 Extracting the subject sequence

Exercise 413

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A19 Extracting alignments

Exercise 414

use BioSearchIO

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])

my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]

while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())

if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n

Solution A20 Locate EST in a genome

Exercise 415

use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

73

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 81: Bio Perl

Appendix A Solutions

my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)

my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)

my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])

my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())

my $i = 0while( my $hsp = $hit-gtnext_hsp())

$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(

-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)

$aln-gtadd_seq($query_seq)

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln

sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)

$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)

$gapped_seq = -return $gapped_seq

74

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 82: Bio Perl

Appendix A Solutions

Example of use

perl solutionblast_locatepl datacontig datablast-estout

Solution A21 Record Blast hits as sequence features

Exercise 416

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair

my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV

foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db

program =gt rsquoblastprsquo)

my $report = $factory-gtblastall($query)

while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)

print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)

last

foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)

$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query

75

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 83: Bio Perl

Appendix A Solutions

Solution A22 Print informations from a BPLite report

Exercise 417

use BioSeqIOuse BioToolsRunStandAloneBlast

my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)

my $query = $Seq_in-gtnext_seq()

my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)

my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n

Solution A23 Create a BioToolsBPlite from a file

Exercise 418

use BioToolsBPlite

my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

76

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 84: Bio Perl

Appendix A Solutions

Solution A24 Create a BioToolsBPlite from standard input

Exercise 419

use BioToolsBPlite

my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())

print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())

print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n

Solution A25 Parse a PSI-blast report

Exercise 420

use BioToolsBPpsilite

my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])

for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )

print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)

if (grep Q$hitE $oldhitarray_ref ) next

print DISAPEARED hit name $hitn

push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )

77

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 85: Bio Perl

Appendix A Solutions

Solution A26 Build a BioSimpleAlign object

Exercise 421

use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq

my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )

my $query1 = lt$query1_ingt

my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )

my $query2 = lt$query2_ingt

my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )

$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)

$report = $factory-gtbl2seq($query1 $query2)

while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength

)my $sbjctSeq = BioLocatableSeq-gtnew(

-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)

$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln

A4 Databases

78

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 86: Bio Perl

Appendix A Solutions

Solution A27

Exercise 51

bull (a) database construction

localbinperl -w

(a) Genscan build a database containing one entry per CDS

use strict

use BioToolsGenscanuse BioSeqIO

my $genscan_file = shift ARGV

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

date of constructionmy $date = gmtime

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)

annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())

$entry-gtadd_SeqFeature($exon)

79

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 87: Bio Perl

Appendix A Solutions

add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())

$entry-gtadd_SeqFeature($utr)

print the entry to the database flatfile in embl formatprint $banque $entry

You can use this script like this

perl solutionbuild_dbpl datagenscan-arabout gt mydbembl

bull (b) make index

usrbinperl -w

(b) Genscan create the index

use strictuse BioIndexEMBL

my $Index_File_Name = shift

my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)

$inx-gtmake_index(ARGV)

You can use this script like this

perl solutionmake_indexpl mydbidx mydbembl

80

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 88: Bio Perl

Appendix A Solutions

bull (c) use index

usrbinperl -w

(c) Genscan retrieve some files

use strictuse BioIndexEMBLuse BioSeqIO

my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)

my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)

foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen

You can use this script like this

perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3

since entries names follows the pattern GENSCAN_predicted_CDS_xxx

Solution A28 Parse a Genscan report and build a database entry with the genomicsequence

Exercise 52

localbinperl -w

Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon

use strict

use BioToolsGenscanuse BioSeqIO

81

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 89: Bio Perl

Appendix A Solutions

my $genscan_file = shift ARGV

my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)

sequence submitted to genscanmy $seq = $seqin-gtnext_seq()

genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)

database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT

-format =gt rsquoemblrsquo)

+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime

print $banque $seq

while(my $gene = $genscan-gtnext_prediction())

create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)

extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())

annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())

$loc = $exon-gtlocation()

if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0

$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)

82

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 90: Bio Perl

Appendix A Solutions

$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)

add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)

construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)

$newloc-gtwarn(exons are coded on different strands)$nocds = 1

if ($loc-gtstrand == -1)

$loc-gtstrand(1)$newloc-gtstrand(-1)

$newloc-gtadd_sub_Location($loc)

$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()

if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo

-location =gt $newloc)

add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())

add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)

construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)

my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)

$entry-gtprimary_seq($newseq)

print the entry to the database flatfile in embl formatprint $banque $entry

83

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 91: Bio Perl

Appendix A Solutions

Solution A29 Build a small bioperl module (for the golden program)

Exercise 53

Build a small bioperl module

package LocalDBAccess

use strictuse vars qw(ISA AVAIL_LOCALDB)

use BioDBRandomAccessIuse BioSeqIOuse BioSeq

BEGIN

AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )

ISA = qw(BioDBRandomAccessI)

sub get_Seq_by_idmy ($self args) = _

return $self-gtget_Seq( args rsquo-irsquo)

sub get_Seq_by_accmy ($self args) = _

return $self-gtget_Seq(args rsquo-arsquo)

sub get_Seq

my ($self $entry $access) = _

84

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 92: Bio Perl

Appendix A Solutions

my ($db $id) = split(rsquorsquo $entry)

if ($access ~ ^-[ia]$) $access =

if ($id) $self-gtthrow (no database specified)

if (_has_local_DB($db))

$self-gtthrow ($db -- no such database available)

my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))

my $seq = $in-gtnext_seq()

$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)

$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)

return $seq

sub _has_local_DB

my ($db) = _

if (exists $AVAIL_LOCALDB$db) return 1 return 0

sub _get_seq_format

my ($db) = _

return $AVAIL_LOCALDB$db

85

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases
Page 93: Bio Perl

Appendix A Solutions

86

  • Bioperl course
    • Chapter 1 General introduction
      • 11 Bioperl documentation
      • 12 General bioperl classes
        • Chapter 2 Sequences
          • 21 The BioSeqIO class
          • 22 Sequence classes
          • 23 Features and Location classes
          • 24 Sequence analysis tools
            • Chapter 3 Alignments
              • 31 AlignIO
              • 32 SimpleAlign
              • 33 Code reading protal2dna
                • Chapter 4 Analysis
                  • 41 Blast
                  • 42 Genscan
                    • Chapter 5 Databases
                      • 51 Database classes
                      • 52 Accessing a local database with golden
                        • Chapter 6 Perl Reminders
                          • 61 UML
                          • 62 Perl reminders to use bioperl modules
                          • 63 Perl reminders for a further advanced understanding of bioperl modules
                            • Appendix A Solutions
                              • A1 Sequences
                              • A2 Alignments
                              • A3 Analysis
                              • A4 Databases