Bio Perl
-
Upload
alejandro-morales-gomez -
Category
Documents
-
view
15 -
download
0
description
Transcript of Bio Perl
![Page 1: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/1.jpg)
Bioperl course
by Catherine Letondal and Katja Schuerer
Bioperl course [httpwwwpasteurfrrechercheunitessisformationbioperl]by Catherine Letondal and Katja Schuerer
Introduction to Bioperl (bioperlorg [httpbioperlorg]) This course introduces to the bioperl modules withexamples and exercises The course content has been upgraded to bioperl 10 (differences with bioperl version 07are displayed in yellow color in the diagrams) Apart from this upgrade the presentation has been reorganized bytopics rather than by daily schedule without any major changes in the content A PDF [supportpdf] version isnow available
Table of Contents
1 General introduction 111 Bioperl documentation 112 General bioperl classes 1
2 Sequences 321 The BioSeqIO class 3
211 Format Converter 322 Sequence classes 4
221 Introduction 4222 Building mechanisms summary 4223 A deeper insight into the BioSeq class 5224 Bioperl rsquoSequencersquo classes structure 10
23 Features and Location classes 11231 Feature 11232 Code reading extracting CDS 13233 Tag system 14234 Location 15235 Graphical view of features 17
24 Sequence analysis tools 183 Alignments 21
31 AlignIO 2132 SimpleAlign 2233 Code reading protal2dna 26
4 Analysis 2741 Blast 27
411 Running Blast 27412 Parsing Blast 28413 BioToolsBPlite family parsers 33414 PSI-BLAST (Position Specific Iterative Blast) 37415 bl2seq Blast 2 sequences 38416 Blast Internal classes structure 39
42 Genscan 415 Databases 45
51 Database classes 4552 Accessing a local database with golden 48
6 Perl Reminders 5161 UML 5162 Perl reminders to use bioperl modules 51
621 References 51622 Filehandles and streams 52623 Exceptions 53624 GetoptStd 54625 Classes 55626 BEGIN block 56
63 Perl reminders for a further advanced understanding of bioperl modules 56631 Modules 56632 Compiler instructions 56633 Tie 57
A Solutions 59A1 Sequences 59A2 Alignments 65A3 Analysis 66A4 Databases 78
List of Figures
21 BioSeq class structure 622 Relation of a SwissProt entry and the corresponding BioSeq object components 623 BioAnnotation package structure 924 Bioperl rsquoSequencersquo classes structure 1025 Features Classes structure 1226 Correspondance between an EMBL entry and bioperl tags 1427 Location Classes Structure 1628 Graphical view of some features of the SwissProt entry BACR_HALHA 1831 AlignIO Classes diagram 2132 Align Classes diagram 2341 Blast Classes diagram 3042 BPLite Classes diagram 3443 Blast internal classes diagram 3944 Genscan Classes Structure 4251 Database Classes structure 4561 UML meanings 51A1 BioSeqIO structure 59
List of Examples
21 SwissProt -gt Fasta 322 Loading a sequence from a remote server 423 Find the references to the PDB database entries 931 Format conversions with AlignIO 2132 Basic methods of SimpleAlign 2233 Filter gap columns 2441 StandAloneBlast run 2742 StandAloneBlast parsing 2843 Parsing from a Blast file 2944 Parsing with BPLite 3345 Running PSI-blast 3746 PSI-Blast SearchIO class 3747 Running bl2seq 3848 Genscan parsing 4149 Genscan parsing with sub-sequences 4151 Database class use 4552 Database Index creation 45
List of Exercises
21 BioSeqIO 322 An universal converter 323 Display a sequence in fasta format 424 More on annotations 1025 Transmembran helices 1026 extractcds 1327 extractcds version 2 1431 Create an alignment without gaps 2641 Running Blast on a Swissprot entry 2742 Running Blast Setting parameters 2743 Running Blast Saving output 2744 Running a Remote Blast 2745 Display Blast hits 2846 Class of a Blast report 2947 Parse a Blast output file 3048 Parse Blast results on standard input 3049 Filtering hits by length 31410 Filtering hits by position 31411 Display the best hit by databank 32412 Multiple queries 32413 Extracting the subject sequence 32414 Extracting alignments 32415 Locate EST in a genome 32416 Record Blast hits as sequence features 32417 Print informations from a BPLite report 34418 Create a BioToolsBPlite from a file 34419 Create a BioToolsBPlite from standard input 34420 Parse a PSI-blast report 38421 Build a BioSimpleAlign object 39422 Code reading BioToolsGenscan module 4351 Parse a Genscan report and build a database entry 4652 Parse a Genscan report and build a database entry with the genomic sequence 4753 Build a small bioperl module (for the golden program) 48
Chapter 1 General introduction
Chapter 1 General introduction
11 Bioperl documentation
1Bioperl 10 tutorial [httpbioperlorgCorebptutorialhtml]
2Wiki Bioperl documentation [httpbioperlorgwikihtmlBioPerlFrontPagehtml]
3Modules listing [httpdocbioperlorgreleasesbioperl-101]
4httpdocbioperlorgbioperl-livesbquomacr [httpdocbioperlorgbioperl-live]
5Unix help man and perldoc
12 General bioperl classesGeneral bioperl classes [httpwwwbioperlorgimagesbioperlpdf] diagram (cf Figure 61)
1
Chapter 1 General introduction
2
Chapter 2 Sequences
Chapter 2 Sequences
21 The BioSeqIO class
211 Format Converter
Example 21 SwissProt -gt Fasta
pseudocode
bull Construct aSwissProtformatted sequences input stream
bull Construct afastaformatted sequences output stream
bull for all sequences
bull read the sequence from input stream
bull write it to output stream
in bioperl
localbinperl -w
use strict
use BioSeqIO
my $in = BioSeqIO-gtnewFh ( -file =gt rsquoltseqssprsquo-format =gt rsquoswissrsquo )
my $out = BioSeqIO-gtnewFh ( -file =gt rsquogtseqsfastarsquo-format =gt rsquofastarsquo )
print $out $_ while lt$ingt
datasseqssp [dataseqssp]
3
Chapter 2 Sequences
Exercise 21 BioSeqIO
Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)
Exercise 22 An universal converter
Modify Example 21 to program an universal converter (Solution A2)
22 Sequence classes
221 Introduction
Example 22 Loading a sequence from a remote server
pseudocode
bull Create a sequence object for theBACR_HALHASwissProtentry
bull Print its Accession number and description
in bioperl
use BioDBSwissProt
my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)
print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn
4
Chapter 2 Sequences
Exercise 23 Display a sequence in fasta format
Display theBACR_HALHAentry infastaformat (Solution A3)
222 Building mechanisms summary
de novo
use BioSeq
my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)
from a file
use BioSeqIO
my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq3 = $seqin-gtnext_seq()
my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo
-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt
by fetching an entry from a remote database
use BioDBSwissProt
my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)
by fetching an entry from a bioperl indexed local database
use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)
5
Chapter 2 Sequences
data small_swissinx [datasmall_swissinx]
223 A deeper insight into the BioSeq class
Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components
Figure 21 BioSeq class structure
6
Chapter 2 Sequences
7
Chapter 2 Sequences
Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents
8
Chapter 2 Sequences
Figure 23 BioAnnotation package structure
Example 23 Find the references to the PDB database entries
Does the protein have a known structure and if so what are its cristallographic structures
Where can you find this information within a SwissProt entry
Where can you find this information in the sequence object
9
Chapter 2 Sequences
in bioperl
PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )
if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())
print nPDB Structures join ( structures) n
Exercise 24 More on annotations
Find the function of the protein within comments (if available) (Solution A4)
Go to
Work on Exercise 26 before you continue
Exercise 25 Transmembran helices
Find transmembrane helices in the entry (Solution A5)
Tip
Use Figure 22 again
Use the appropriate objectrsquos documentation
Here is a summary [codesuseSeq10pl] solution of the exercises in this section
224 Bioperl rsquoSequencersquo classes structure
10
Chapter 2 Sequences
Figure 24 Bioperl rsquoSequencersquo classes structure
23 Features and Location classes
231 Feature
11
Chapter 2 Sequences
They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)
Documentation on features
bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]
bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]
Features Classes
bull BioSeqFeatureI
bull BioSeqFeatureGeneric
bull BioSeqFeatureFeaturePair
bull BioSeqFeatureSimilarityPair
bull BioSeqFeatureSimilarity
bull BioSeqFeatureGene
bull BioSeqFeatureGeneExonI
bull BioSeqFeatureGeneExon
bull BioSeqFeatureGeneTranscript
bull BioSeqFeatureGeneTranscriptI
bull BioSeqFeatureGeneGeneStructureI
bull BioSeqFeatureGeneGeneStructure
12
Chapter 2 Sequences
Figure 25 Features Classes structure
232 Code reading extracting CDS
Exercise 26 extractcds
code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )
fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])
13
Chapter 2 Sequences
Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]
Tip
Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class
Tip
Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class
Exercise 27 extractcds version 2
compare Exercise 26 with this version [lecture_codeextractcds2]
233 Tag system
This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])
In the following exampleCDS(feature key) is a primary tagcodon a secondary tag
CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)
The tag system also enables to create new qualifier types Examples in this tutorial
bull Blast hits as features (see Figure 41 and Exercise 416)
bull Creation of a database from a Genscan parsing (Exercise 51)
See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation
14
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 2: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/2.jpg)
Bioperl course [httpwwwpasteurfrrechercheunitessisformationbioperl]by Catherine Letondal and Katja Schuerer
Introduction to Bioperl (bioperlorg [httpbioperlorg]) This course introduces to the bioperl modules withexamples and exercises The course content has been upgraded to bioperl 10 (differences with bioperl version 07are displayed in yellow color in the diagrams) Apart from this upgrade the presentation has been reorganized bytopics rather than by daily schedule without any major changes in the content A PDF [supportpdf] version isnow available
Table of Contents
1 General introduction 111 Bioperl documentation 112 General bioperl classes 1
2 Sequences 321 The BioSeqIO class 3
211 Format Converter 322 Sequence classes 4
221 Introduction 4222 Building mechanisms summary 4223 A deeper insight into the BioSeq class 5224 Bioperl rsquoSequencersquo classes structure 10
23 Features and Location classes 11231 Feature 11232 Code reading extracting CDS 13233 Tag system 14234 Location 15235 Graphical view of features 17
24 Sequence analysis tools 183 Alignments 21
31 AlignIO 2132 SimpleAlign 2233 Code reading protal2dna 26
4 Analysis 2741 Blast 27
411 Running Blast 27412 Parsing Blast 28413 BioToolsBPlite family parsers 33414 PSI-BLAST (Position Specific Iterative Blast) 37415 bl2seq Blast 2 sequences 38416 Blast Internal classes structure 39
42 Genscan 415 Databases 45
51 Database classes 4552 Accessing a local database with golden 48
6 Perl Reminders 5161 UML 5162 Perl reminders to use bioperl modules 51
621 References 51622 Filehandles and streams 52623 Exceptions 53624 GetoptStd 54625 Classes 55626 BEGIN block 56
63 Perl reminders for a further advanced understanding of bioperl modules 56631 Modules 56632 Compiler instructions 56633 Tie 57
A Solutions 59A1 Sequences 59A2 Alignments 65A3 Analysis 66A4 Databases 78
List of Figures
21 BioSeq class structure 622 Relation of a SwissProt entry and the corresponding BioSeq object components 623 BioAnnotation package structure 924 Bioperl rsquoSequencersquo classes structure 1025 Features Classes structure 1226 Correspondance between an EMBL entry and bioperl tags 1427 Location Classes Structure 1628 Graphical view of some features of the SwissProt entry BACR_HALHA 1831 AlignIO Classes diagram 2132 Align Classes diagram 2341 Blast Classes diagram 3042 BPLite Classes diagram 3443 Blast internal classes diagram 3944 Genscan Classes Structure 4251 Database Classes structure 4561 UML meanings 51A1 BioSeqIO structure 59
List of Examples
21 SwissProt -gt Fasta 322 Loading a sequence from a remote server 423 Find the references to the PDB database entries 931 Format conversions with AlignIO 2132 Basic methods of SimpleAlign 2233 Filter gap columns 2441 StandAloneBlast run 2742 StandAloneBlast parsing 2843 Parsing from a Blast file 2944 Parsing with BPLite 3345 Running PSI-blast 3746 PSI-Blast SearchIO class 3747 Running bl2seq 3848 Genscan parsing 4149 Genscan parsing with sub-sequences 4151 Database class use 4552 Database Index creation 45
List of Exercises
21 BioSeqIO 322 An universal converter 323 Display a sequence in fasta format 424 More on annotations 1025 Transmembran helices 1026 extractcds 1327 extractcds version 2 1431 Create an alignment without gaps 2641 Running Blast on a Swissprot entry 2742 Running Blast Setting parameters 2743 Running Blast Saving output 2744 Running a Remote Blast 2745 Display Blast hits 2846 Class of a Blast report 2947 Parse a Blast output file 3048 Parse Blast results on standard input 3049 Filtering hits by length 31410 Filtering hits by position 31411 Display the best hit by databank 32412 Multiple queries 32413 Extracting the subject sequence 32414 Extracting alignments 32415 Locate EST in a genome 32416 Record Blast hits as sequence features 32417 Print informations from a BPLite report 34418 Create a BioToolsBPlite from a file 34419 Create a BioToolsBPlite from standard input 34420 Parse a PSI-blast report 38421 Build a BioSimpleAlign object 39422 Code reading BioToolsGenscan module 4351 Parse a Genscan report and build a database entry 4652 Parse a Genscan report and build a database entry with the genomic sequence 4753 Build a small bioperl module (for the golden program) 48
Chapter 1 General introduction
Chapter 1 General introduction
11 Bioperl documentation
1Bioperl 10 tutorial [httpbioperlorgCorebptutorialhtml]
2Wiki Bioperl documentation [httpbioperlorgwikihtmlBioPerlFrontPagehtml]
3Modules listing [httpdocbioperlorgreleasesbioperl-101]
4httpdocbioperlorgbioperl-livesbquomacr [httpdocbioperlorgbioperl-live]
5Unix help man and perldoc
12 General bioperl classesGeneral bioperl classes [httpwwwbioperlorgimagesbioperlpdf] diagram (cf Figure 61)
1
Chapter 1 General introduction
2
Chapter 2 Sequences
Chapter 2 Sequences
21 The BioSeqIO class
211 Format Converter
Example 21 SwissProt -gt Fasta
pseudocode
bull Construct aSwissProtformatted sequences input stream
bull Construct afastaformatted sequences output stream
bull for all sequences
bull read the sequence from input stream
bull write it to output stream
in bioperl
localbinperl -w
use strict
use BioSeqIO
my $in = BioSeqIO-gtnewFh ( -file =gt rsquoltseqssprsquo-format =gt rsquoswissrsquo )
my $out = BioSeqIO-gtnewFh ( -file =gt rsquogtseqsfastarsquo-format =gt rsquofastarsquo )
print $out $_ while lt$ingt
datasseqssp [dataseqssp]
3
Chapter 2 Sequences
Exercise 21 BioSeqIO
Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)
Exercise 22 An universal converter
Modify Example 21 to program an universal converter (Solution A2)
22 Sequence classes
221 Introduction
Example 22 Loading a sequence from a remote server
pseudocode
bull Create a sequence object for theBACR_HALHASwissProtentry
bull Print its Accession number and description
in bioperl
use BioDBSwissProt
my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)
print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn
4
Chapter 2 Sequences
Exercise 23 Display a sequence in fasta format
Display theBACR_HALHAentry infastaformat (Solution A3)
222 Building mechanisms summary
de novo
use BioSeq
my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)
from a file
use BioSeqIO
my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq3 = $seqin-gtnext_seq()
my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo
-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt
by fetching an entry from a remote database
use BioDBSwissProt
my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)
by fetching an entry from a bioperl indexed local database
use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)
5
Chapter 2 Sequences
data small_swissinx [datasmall_swissinx]
223 A deeper insight into the BioSeq class
Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components
Figure 21 BioSeq class structure
6
Chapter 2 Sequences
7
Chapter 2 Sequences
Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents
8
Chapter 2 Sequences
Figure 23 BioAnnotation package structure
Example 23 Find the references to the PDB database entries
Does the protein have a known structure and if so what are its cristallographic structures
Where can you find this information within a SwissProt entry
Where can you find this information in the sequence object
9
Chapter 2 Sequences
in bioperl
PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )
if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())
print nPDB Structures join ( structures) n
Exercise 24 More on annotations
Find the function of the protein within comments (if available) (Solution A4)
Go to
Work on Exercise 26 before you continue
Exercise 25 Transmembran helices
Find transmembrane helices in the entry (Solution A5)
Tip
Use Figure 22 again
Use the appropriate objectrsquos documentation
Here is a summary [codesuseSeq10pl] solution of the exercises in this section
224 Bioperl rsquoSequencersquo classes structure
10
Chapter 2 Sequences
Figure 24 Bioperl rsquoSequencersquo classes structure
23 Features and Location classes
231 Feature
11
Chapter 2 Sequences
They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)
Documentation on features
bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]
bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]
Features Classes
bull BioSeqFeatureI
bull BioSeqFeatureGeneric
bull BioSeqFeatureFeaturePair
bull BioSeqFeatureSimilarityPair
bull BioSeqFeatureSimilarity
bull BioSeqFeatureGene
bull BioSeqFeatureGeneExonI
bull BioSeqFeatureGeneExon
bull BioSeqFeatureGeneTranscript
bull BioSeqFeatureGeneTranscriptI
bull BioSeqFeatureGeneGeneStructureI
bull BioSeqFeatureGeneGeneStructure
12
Chapter 2 Sequences
Figure 25 Features Classes structure
232 Code reading extracting CDS
Exercise 26 extractcds
code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )
fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])
13
Chapter 2 Sequences
Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]
Tip
Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class
Tip
Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class
Exercise 27 extractcds version 2
compare Exercise 26 with this version [lecture_codeextractcds2]
233 Tag system
This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])
In the following exampleCDS(feature key) is a primary tagcodon a secondary tag
CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)
The tag system also enables to create new qualifier types Examples in this tutorial
bull Blast hits as features (see Figure 41 and Exercise 416)
bull Creation of a database from a Genscan parsing (Exercise 51)
See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation
14
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 3: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/3.jpg)
Table of Contents
1 General introduction 111 Bioperl documentation 112 General bioperl classes 1
2 Sequences 321 The BioSeqIO class 3
211 Format Converter 322 Sequence classes 4
221 Introduction 4222 Building mechanisms summary 4223 A deeper insight into the BioSeq class 5224 Bioperl rsquoSequencersquo classes structure 10
23 Features and Location classes 11231 Feature 11232 Code reading extracting CDS 13233 Tag system 14234 Location 15235 Graphical view of features 17
24 Sequence analysis tools 183 Alignments 21
31 AlignIO 2132 SimpleAlign 2233 Code reading protal2dna 26
4 Analysis 2741 Blast 27
411 Running Blast 27412 Parsing Blast 28413 BioToolsBPlite family parsers 33414 PSI-BLAST (Position Specific Iterative Blast) 37415 bl2seq Blast 2 sequences 38416 Blast Internal classes structure 39
42 Genscan 415 Databases 45
51 Database classes 4552 Accessing a local database with golden 48
6 Perl Reminders 5161 UML 5162 Perl reminders to use bioperl modules 51
621 References 51622 Filehandles and streams 52623 Exceptions 53624 GetoptStd 54625 Classes 55626 BEGIN block 56
63 Perl reminders for a further advanced understanding of bioperl modules 56631 Modules 56632 Compiler instructions 56633 Tie 57
A Solutions 59A1 Sequences 59A2 Alignments 65A3 Analysis 66A4 Databases 78
List of Figures
21 BioSeq class structure 622 Relation of a SwissProt entry and the corresponding BioSeq object components 623 BioAnnotation package structure 924 Bioperl rsquoSequencersquo classes structure 1025 Features Classes structure 1226 Correspondance between an EMBL entry and bioperl tags 1427 Location Classes Structure 1628 Graphical view of some features of the SwissProt entry BACR_HALHA 1831 AlignIO Classes diagram 2132 Align Classes diagram 2341 Blast Classes diagram 3042 BPLite Classes diagram 3443 Blast internal classes diagram 3944 Genscan Classes Structure 4251 Database Classes structure 4561 UML meanings 51A1 BioSeqIO structure 59
List of Examples
21 SwissProt -gt Fasta 322 Loading a sequence from a remote server 423 Find the references to the PDB database entries 931 Format conversions with AlignIO 2132 Basic methods of SimpleAlign 2233 Filter gap columns 2441 StandAloneBlast run 2742 StandAloneBlast parsing 2843 Parsing from a Blast file 2944 Parsing with BPLite 3345 Running PSI-blast 3746 PSI-Blast SearchIO class 3747 Running bl2seq 3848 Genscan parsing 4149 Genscan parsing with sub-sequences 4151 Database class use 4552 Database Index creation 45
List of Exercises
21 BioSeqIO 322 An universal converter 323 Display a sequence in fasta format 424 More on annotations 1025 Transmembran helices 1026 extractcds 1327 extractcds version 2 1431 Create an alignment without gaps 2641 Running Blast on a Swissprot entry 2742 Running Blast Setting parameters 2743 Running Blast Saving output 2744 Running a Remote Blast 2745 Display Blast hits 2846 Class of a Blast report 2947 Parse a Blast output file 3048 Parse Blast results on standard input 3049 Filtering hits by length 31410 Filtering hits by position 31411 Display the best hit by databank 32412 Multiple queries 32413 Extracting the subject sequence 32414 Extracting alignments 32415 Locate EST in a genome 32416 Record Blast hits as sequence features 32417 Print informations from a BPLite report 34418 Create a BioToolsBPlite from a file 34419 Create a BioToolsBPlite from standard input 34420 Parse a PSI-blast report 38421 Build a BioSimpleAlign object 39422 Code reading BioToolsGenscan module 4351 Parse a Genscan report and build a database entry 4652 Parse a Genscan report and build a database entry with the genomic sequence 4753 Build a small bioperl module (for the golden program) 48
Chapter 1 General introduction
Chapter 1 General introduction
11 Bioperl documentation
1Bioperl 10 tutorial [httpbioperlorgCorebptutorialhtml]
2Wiki Bioperl documentation [httpbioperlorgwikihtmlBioPerlFrontPagehtml]
3Modules listing [httpdocbioperlorgreleasesbioperl-101]
4httpdocbioperlorgbioperl-livesbquomacr [httpdocbioperlorgbioperl-live]
5Unix help man and perldoc
12 General bioperl classesGeneral bioperl classes [httpwwwbioperlorgimagesbioperlpdf] diagram (cf Figure 61)
1
Chapter 1 General introduction
2
Chapter 2 Sequences
Chapter 2 Sequences
21 The BioSeqIO class
211 Format Converter
Example 21 SwissProt -gt Fasta
pseudocode
bull Construct aSwissProtformatted sequences input stream
bull Construct afastaformatted sequences output stream
bull for all sequences
bull read the sequence from input stream
bull write it to output stream
in bioperl
localbinperl -w
use strict
use BioSeqIO
my $in = BioSeqIO-gtnewFh ( -file =gt rsquoltseqssprsquo-format =gt rsquoswissrsquo )
my $out = BioSeqIO-gtnewFh ( -file =gt rsquogtseqsfastarsquo-format =gt rsquofastarsquo )
print $out $_ while lt$ingt
datasseqssp [dataseqssp]
3
Chapter 2 Sequences
Exercise 21 BioSeqIO
Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)
Exercise 22 An universal converter
Modify Example 21 to program an universal converter (Solution A2)
22 Sequence classes
221 Introduction
Example 22 Loading a sequence from a remote server
pseudocode
bull Create a sequence object for theBACR_HALHASwissProtentry
bull Print its Accession number and description
in bioperl
use BioDBSwissProt
my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)
print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn
4
Chapter 2 Sequences
Exercise 23 Display a sequence in fasta format
Display theBACR_HALHAentry infastaformat (Solution A3)
222 Building mechanisms summary
de novo
use BioSeq
my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)
from a file
use BioSeqIO
my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq3 = $seqin-gtnext_seq()
my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo
-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt
by fetching an entry from a remote database
use BioDBSwissProt
my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)
by fetching an entry from a bioperl indexed local database
use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)
5
Chapter 2 Sequences
data small_swissinx [datasmall_swissinx]
223 A deeper insight into the BioSeq class
Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components
Figure 21 BioSeq class structure
6
Chapter 2 Sequences
7
Chapter 2 Sequences
Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents
8
Chapter 2 Sequences
Figure 23 BioAnnotation package structure
Example 23 Find the references to the PDB database entries
Does the protein have a known structure and if so what are its cristallographic structures
Where can you find this information within a SwissProt entry
Where can you find this information in the sequence object
9
Chapter 2 Sequences
in bioperl
PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )
if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())
print nPDB Structures join ( structures) n
Exercise 24 More on annotations
Find the function of the protein within comments (if available) (Solution A4)
Go to
Work on Exercise 26 before you continue
Exercise 25 Transmembran helices
Find transmembrane helices in the entry (Solution A5)
Tip
Use Figure 22 again
Use the appropriate objectrsquos documentation
Here is a summary [codesuseSeq10pl] solution of the exercises in this section
224 Bioperl rsquoSequencersquo classes structure
10
Chapter 2 Sequences
Figure 24 Bioperl rsquoSequencersquo classes structure
23 Features and Location classes
231 Feature
11
Chapter 2 Sequences
They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)
Documentation on features
bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]
bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]
Features Classes
bull BioSeqFeatureI
bull BioSeqFeatureGeneric
bull BioSeqFeatureFeaturePair
bull BioSeqFeatureSimilarityPair
bull BioSeqFeatureSimilarity
bull BioSeqFeatureGene
bull BioSeqFeatureGeneExonI
bull BioSeqFeatureGeneExon
bull BioSeqFeatureGeneTranscript
bull BioSeqFeatureGeneTranscriptI
bull BioSeqFeatureGeneGeneStructureI
bull BioSeqFeatureGeneGeneStructure
12
Chapter 2 Sequences
Figure 25 Features Classes structure
232 Code reading extracting CDS
Exercise 26 extractcds
code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )
fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])
13
Chapter 2 Sequences
Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]
Tip
Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class
Tip
Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class
Exercise 27 extractcds version 2
compare Exercise 26 with this version [lecture_codeextractcds2]
233 Tag system
This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])
In the following exampleCDS(feature key) is a primary tagcodon a secondary tag
CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)
The tag system also enables to create new qualifier types Examples in this tutorial
bull Blast hits as features (see Figure 41 and Exercise 416)
bull Creation of a database from a Genscan parsing (Exercise 51)
See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation
14
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 4: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/4.jpg)
63 Perl reminders for a further advanced understanding of bioperl modules 56631 Modules 56632 Compiler instructions 56633 Tie 57
A Solutions 59A1 Sequences 59A2 Alignments 65A3 Analysis 66A4 Databases 78
List of Figures
21 BioSeq class structure 622 Relation of a SwissProt entry and the corresponding BioSeq object components 623 BioAnnotation package structure 924 Bioperl rsquoSequencersquo classes structure 1025 Features Classes structure 1226 Correspondance between an EMBL entry and bioperl tags 1427 Location Classes Structure 1628 Graphical view of some features of the SwissProt entry BACR_HALHA 1831 AlignIO Classes diagram 2132 Align Classes diagram 2341 Blast Classes diagram 3042 BPLite Classes diagram 3443 Blast internal classes diagram 3944 Genscan Classes Structure 4251 Database Classes structure 4561 UML meanings 51A1 BioSeqIO structure 59
List of Examples
21 SwissProt -gt Fasta 322 Loading a sequence from a remote server 423 Find the references to the PDB database entries 931 Format conversions with AlignIO 2132 Basic methods of SimpleAlign 2233 Filter gap columns 2441 StandAloneBlast run 2742 StandAloneBlast parsing 2843 Parsing from a Blast file 2944 Parsing with BPLite 3345 Running PSI-blast 3746 PSI-Blast SearchIO class 3747 Running bl2seq 3848 Genscan parsing 4149 Genscan parsing with sub-sequences 4151 Database class use 4552 Database Index creation 45
List of Exercises
21 BioSeqIO 322 An universal converter 323 Display a sequence in fasta format 424 More on annotations 1025 Transmembran helices 1026 extractcds 1327 extractcds version 2 1431 Create an alignment without gaps 2641 Running Blast on a Swissprot entry 2742 Running Blast Setting parameters 2743 Running Blast Saving output 2744 Running a Remote Blast 2745 Display Blast hits 2846 Class of a Blast report 2947 Parse a Blast output file 3048 Parse Blast results on standard input 3049 Filtering hits by length 31410 Filtering hits by position 31411 Display the best hit by databank 32412 Multiple queries 32413 Extracting the subject sequence 32414 Extracting alignments 32415 Locate EST in a genome 32416 Record Blast hits as sequence features 32417 Print informations from a BPLite report 34418 Create a BioToolsBPlite from a file 34419 Create a BioToolsBPlite from standard input 34420 Parse a PSI-blast report 38421 Build a BioSimpleAlign object 39422 Code reading BioToolsGenscan module 4351 Parse a Genscan report and build a database entry 4652 Parse a Genscan report and build a database entry with the genomic sequence 4753 Build a small bioperl module (for the golden program) 48
Chapter 1 General introduction
Chapter 1 General introduction
11 Bioperl documentation
1Bioperl 10 tutorial [httpbioperlorgCorebptutorialhtml]
2Wiki Bioperl documentation [httpbioperlorgwikihtmlBioPerlFrontPagehtml]
3Modules listing [httpdocbioperlorgreleasesbioperl-101]
4httpdocbioperlorgbioperl-livesbquomacr [httpdocbioperlorgbioperl-live]
5Unix help man and perldoc
12 General bioperl classesGeneral bioperl classes [httpwwwbioperlorgimagesbioperlpdf] diagram (cf Figure 61)
1
Chapter 1 General introduction
2
Chapter 2 Sequences
Chapter 2 Sequences
21 The BioSeqIO class
211 Format Converter
Example 21 SwissProt -gt Fasta
pseudocode
bull Construct aSwissProtformatted sequences input stream
bull Construct afastaformatted sequences output stream
bull for all sequences
bull read the sequence from input stream
bull write it to output stream
in bioperl
localbinperl -w
use strict
use BioSeqIO
my $in = BioSeqIO-gtnewFh ( -file =gt rsquoltseqssprsquo-format =gt rsquoswissrsquo )
my $out = BioSeqIO-gtnewFh ( -file =gt rsquogtseqsfastarsquo-format =gt rsquofastarsquo )
print $out $_ while lt$ingt
datasseqssp [dataseqssp]
3
Chapter 2 Sequences
Exercise 21 BioSeqIO
Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)
Exercise 22 An universal converter
Modify Example 21 to program an universal converter (Solution A2)
22 Sequence classes
221 Introduction
Example 22 Loading a sequence from a remote server
pseudocode
bull Create a sequence object for theBACR_HALHASwissProtentry
bull Print its Accession number and description
in bioperl
use BioDBSwissProt
my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)
print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn
4
Chapter 2 Sequences
Exercise 23 Display a sequence in fasta format
Display theBACR_HALHAentry infastaformat (Solution A3)
222 Building mechanisms summary
de novo
use BioSeq
my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)
from a file
use BioSeqIO
my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq3 = $seqin-gtnext_seq()
my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo
-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt
by fetching an entry from a remote database
use BioDBSwissProt
my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)
by fetching an entry from a bioperl indexed local database
use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)
5
Chapter 2 Sequences
data small_swissinx [datasmall_swissinx]
223 A deeper insight into the BioSeq class
Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components
Figure 21 BioSeq class structure
6
Chapter 2 Sequences
7
Chapter 2 Sequences
Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents
8
Chapter 2 Sequences
Figure 23 BioAnnotation package structure
Example 23 Find the references to the PDB database entries
Does the protein have a known structure and if so what are its cristallographic structures
Where can you find this information within a SwissProt entry
Where can you find this information in the sequence object
9
Chapter 2 Sequences
in bioperl
PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )
if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())
print nPDB Structures join ( structures) n
Exercise 24 More on annotations
Find the function of the protein within comments (if available) (Solution A4)
Go to
Work on Exercise 26 before you continue
Exercise 25 Transmembran helices
Find transmembrane helices in the entry (Solution A5)
Tip
Use Figure 22 again
Use the appropriate objectrsquos documentation
Here is a summary [codesuseSeq10pl] solution of the exercises in this section
224 Bioperl rsquoSequencersquo classes structure
10
Chapter 2 Sequences
Figure 24 Bioperl rsquoSequencersquo classes structure
23 Features and Location classes
231 Feature
11
Chapter 2 Sequences
They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)
Documentation on features
bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]
bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]
Features Classes
bull BioSeqFeatureI
bull BioSeqFeatureGeneric
bull BioSeqFeatureFeaturePair
bull BioSeqFeatureSimilarityPair
bull BioSeqFeatureSimilarity
bull BioSeqFeatureGene
bull BioSeqFeatureGeneExonI
bull BioSeqFeatureGeneExon
bull BioSeqFeatureGeneTranscript
bull BioSeqFeatureGeneTranscriptI
bull BioSeqFeatureGeneGeneStructureI
bull BioSeqFeatureGeneGeneStructure
12
Chapter 2 Sequences
Figure 25 Features Classes structure
232 Code reading extracting CDS
Exercise 26 extractcds
code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )
fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])
13
Chapter 2 Sequences
Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]
Tip
Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class
Tip
Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class
Exercise 27 extractcds version 2
compare Exercise 26 with this version [lecture_codeextractcds2]
233 Tag system
This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])
In the following exampleCDS(feature key) is a primary tagcodon a secondary tag
CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)
The tag system also enables to create new qualifier types Examples in this tutorial
bull Blast hits as features (see Figure 41 and Exercise 416)
bull Creation of a database from a Genscan parsing (Exercise 51)
See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation
14
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 5: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/5.jpg)
List of Figures
21 BioSeq class structure 622 Relation of a SwissProt entry and the corresponding BioSeq object components 623 BioAnnotation package structure 924 Bioperl rsquoSequencersquo classes structure 1025 Features Classes structure 1226 Correspondance between an EMBL entry and bioperl tags 1427 Location Classes Structure 1628 Graphical view of some features of the SwissProt entry BACR_HALHA 1831 AlignIO Classes diagram 2132 Align Classes diagram 2341 Blast Classes diagram 3042 BPLite Classes diagram 3443 Blast internal classes diagram 3944 Genscan Classes Structure 4251 Database Classes structure 4561 UML meanings 51A1 BioSeqIO structure 59
List of Examples
21 SwissProt -gt Fasta 322 Loading a sequence from a remote server 423 Find the references to the PDB database entries 931 Format conversions with AlignIO 2132 Basic methods of SimpleAlign 2233 Filter gap columns 2441 StandAloneBlast run 2742 StandAloneBlast parsing 2843 Parsing from a Blast file 2944 Parsing with BPLite 3345 Running PSI-blast 3746 PSI-Blast SearchIO class 3747 Running bl2seq 3848 Genscan parsing 4149 Genscan parsing with sub-sequences 4151 Database class use 4552 Database Index creation 45
List of Exercises
21 BioSeqIO 322 An universal converter 323 Display a sequence in fasta format 424 More on annotations 1025 Transmembran helices 1026 extractcds 1327 extractcds version 2 1431 Create an alignment without gaps 2641 Running Blast on a Swissprot entry 2742 Running Blast Setting parameters 2743 Running Blast Saving output 2744 Running a Remote Blast 2745 Display Blast hits 2846 Class of a Blast report 2947 Parse a Blast output file 3048 Parse Blast results on standard input 3049 Filtering hits by length 31410 Filtering hits by position 31411 Display the best hit by databank 32412 Multiple queries 32413 Extracting the subject sequence 32414 Extracting alignments 32415 Locate EST in a genome 32416 Record Blast hits as sequence features 32417 Print informations from a BPLite report 34418 Create a BioToolsBPlite from a file 34419 Create a BioToolsBPlite from standard input 34420 Parse a PSI-blast report 38421 Build a BioSimpleAlign object 39422 Code reading BioToolsGenscan module 4351 Parse a Genscan report and build a database entry 4652 Parse a Genscan report and build a database entry with the genomic sequence 4753 Build a small bioperl module (for the golden program) 48
Chapter 1 General introduction
Chapter 1 General introduction
11 Bioperl documentation
1Bioperl 10 tutorial [httpbioperlorgCorebptutorialhtml]
2Wiki Bioperl documentation [httpbioperlorgwikihtmlBioPerlFrontPagehtml]
3Modules listing [httpdocbioperlorgreleasesbioperl-101]
4httpdocbioperlorgbioperl-livesbquomacr [httpdocbioperlorgbioperl-live]
5Unix help man and perldoc
12 General bioperl classesGeneral bioperl classes [httpwwwbioperlorgimagesbioperlpdf] diagram (cf Figure 61)
1
Chapter 1 General introduction
2
Chapter 2 Sequences
Chapter 2 Sequences
21 The BioSeqIO class
211 Format Converter
Example 21 SwissProt -gt Fasta
pseudocode
bull Construct aSwissProtformatted sequences input stream
bull Construct afastaformatted sequences output stream
bull for all sequences
bull read the sequence from input stream
bull write it to output stream
in bioperl
localbinperl -w
use strict
use BioSeqIO
my $in = BioSeqIO-gtnewFh ( -file =gt rsquoltseqssprsquo-format =gt rsquoswissrsquo )
my $out = BioSeqIO-gtnewFh ( -file =gt rsquogtseqsfastarsquo-format =gt rsquofastarsquo )
print $out $_ while lt$ingt
datasseqssp [dataseqssp]
3
Chapter 2 Sequences
Exercise 21 BioSeqIO
Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)
Exercise 22 An universal converter
Modify Example 21 to program an universal converter (Solution A2)
22 Sequence classes
221 Introduction
Example 22 Loading a sequence from a remote server
pseudocode
bull Create a sequence object for theBACR_HALHASwissProtentry
bull Print its Accession number and description
in bioperl
use BioDBSwissProt
my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)
print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn
4
Chapter 2 Sequences
Exercise 23 Display a sequence in fasta format
Display theBACR_HALHAentry infastaformat (Solution A3)
222 Building mechanisms summary
de novo
use BioSeq
my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)
from a file
use BioSeqIO
my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq3 = $seqin-gtnext_seq()
my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo
-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt
by fetching an entry from a remote database
use BioDBSwissProt
my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)
by fetching an entry from a bioperl indexed local database
use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)
5
Chapter 2 Sequences
data small_swissinx [datasmall_swissinx]
223 A deeper insight into the BioSeq class
Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components
Figure 21 BioSeq class structure
6
Chapter 2 Sequences
7
Chapter 2 Sequences
Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents
8
Chapter 2 Sequences
Figure 23 BioAnnotation package structure
Example 23 Find the references to the PDB database entries
Does the protein have a known structure and if so what are its cristallographic structures
Where can you find this information within a SwissProt entry
Where can you find this information in the sequence object
9
Chapter 2 Sequences
in bioperl
PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )
if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())
print nPDB Structures join ( structures) n
Exercise 24 More on annotations
Find the function of the protein within comments (if available) (Solution A4)
Go to
Work on Exercise 26 before you continue
Exercise 25 Transmembran helices
Find transmembrane helices in the entry (Solution A5)
Tip
Use Figure 22 again
Use the appropriate objectrsquos documentation
Here is a summary [codesuseSeq10pl] solution of the exercises in this section
224 Bioperl rsquoSequencersquo classes structure
10
Chapter 2 Sequences
Figure 24 Bioperl rsquoSequencersquo classes structure
23 Features and Location classes
231 Feature
11
Chapter 2 Sequences
They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)
Documentation on features
bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]
bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]
Features Classes
bull BioSeqFeatureI
bull BioSeqFeatureGeneric
bull BioSeqFeatureFeaturePair
bull BioSeqFeatureSimilarityPair
bull BioSeqFeatureSimilarity
bull BioSeqFeatureGene
bull BioSeqFeatureGeneExonI
bull BioSeqFeatureGeneExon
bull BioSeqFeatureGeneTranscript
bull BioSeqFeatureGeneTranscriptI
bull BioSeqFeatureGeneGeneStructureI
bull BioSeqFeatureGeneGeneStructure
12
Chapter 2 Sequences
Figure 25 Features Classes structure
232 Code reading extracting CDS
Exercise 26 extractcds
code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )
fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])
13
Chapter 2 Sequences
Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]
Tip
Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class
Tip
Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class
Exercise 27 extractcds version 2
compare Exercise 26 with this version [lecture_codeextractcds2]
233 Tag system
This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])
In the following exampleCDS(feature key) is a primary tagcodon a secondary tag
CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)
The tag system also enables to create new qualifier types Examples in this tutorial
bull Blast hits as features (see Figure 41 and Exercise 416)
bull Creation of a database from a Genscan parsing (Exercise 51)
See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation
14
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 6: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/6.jpg)
List of Examples
21 SwissProt -gt Fasta 322 Loading a sequence from a remote server 423 Find the references to the PDB database entries 931 Format conversions with AlignIO 2132 Basic methods of SimpleAlign 2233 Filter gap columns 2441 StandAloneBlast run 2742 StandAloneBlast parsing 2843 Parsing from a Blast file 2944 Parsing with BPLite 3345 Running PSI-blast 3746 PSI-Blast SearchIO class 3747 Running bl2seq 3848 Genscan parsing 4149 Genscan parsing with sub-sequences 4151 Database class use 4552 Database Index creation 45
List of Exercises
21 BioSeqIO 322 An universal converter 323 Display a sequence in fasta format 424 More on annotations 1025 Transmembran helices 1026 extractcds 1327 extractcds version 2 1431 Create an alignment without gaps 2641 Running Blast on a Swissprot entry 2742 Running Blast Setting parameters 2743 Running Blast Saving output 2744 Running a Remote Blast 2745 Display Blast hits 2846 Class of a Blast report 2947 Parse a Blast output file 3048 Parse Blast results on standard input 3049 Filtering hits by length 31410 Filtering hits by position 31411 Display the best hit by databank 32412 Multiple queries 32413 Extracting the subject sequence 32414 Extracting alignments 32415 Locate EST in a genome 32416 Record Blast hits as sequence features 32417 Print informations from a BPLite report 34418 Create a BioToolsBPlite from a file 34419 Create a BioToolsBPlite from standard input 34420 Parse a PSI-blast report 38421 Build a BioSimpleAlign object 39422 Code reading BioToolsGenscan module 4351 Parse a Genscan report and build a database entry 4652 Parse a Genscan report and build a database entry with the genomic sequence 4753 Build a small bioperl module (for the golden program) 48
Chapter 1 General introduction
Chapter 1 General introduction
11 Bioperl documentation
1Bioperl 10 tutorial [httpbioperlorgCorebptutorialhtml]
2Wiki Bioperl documentation [httpbioperlorgwikihtmlBioPerlFrontPagehtml]
3Modules listing [httpdocbioperlorgreleasesbioperl-101]
4httpdocbioperlorgbioperl-livesbquomacr [httpdocbioperlorgbioperl-live]
5Unix help man and perldoc
12 General bioperl classesGeneral bioperl classes [httpwwwbioperlorgimagesbioperlpdf] diagram (cf Figure 61)
1
Chapter 1 General introduction
2
Chapter 2 Sequences
Chapter 2 Sequences
21 The BioSeqIO class
211 Format Converter
Example 21 SwissProt -gt Fasta
pseudocode
bull Construct aSwissProtformatted sequences input stream
bull Construct afastaformatted sequences output stream
bull for all sequences
bull read the sequence from input stream
bull write it to output stream
in bioperl
localbinperl -w
use strict
use BioSeqIO
my $in = BioSeqIO-gtnewFh ( -file =gt rsquoltseqssprsquo-format =gt rsquoswissrsquo )
my $out = BioSeqIO-gtnewFh ( -file =gt rsquogtseqsfastarsquo-format =gt rsquofastarsquo )
print $out $_ while lt$ingt
datasseqssp [dataseqssp]
3
Chapter 2 Sequences
Exercise 21 BioSeqIO
Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)
Exercise 22 An universal converter
Modify Example 21 to program an universal converter (Solution A2)
22 Sequence classes
221 Introduction
Example 22 Loading a sequence from a remote server
pseudocode
bull Create a sequence object for theBACR_HALHASwissProtentry
bull Print its Accession number and description
in bioperl
use BioDBSwissProt
my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)
print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn
4
Chapter 2 Sequences
Exercise 23 Display a sequence in fasta format
Display theBACR_HALHAentry infastaformat (Solution A3)
222 Building mechanisms summary
de novo
use BioSeq
my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)
from a file
use BioSeqIO
my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq3 = $seqin-gtnext_seq()
my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo
-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt
by fetching an entry from a remote database
use BioDBSwissProt
my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)
by fetching an entry from a bioperl indexed local database
use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)
5
Chapter 2 Sequences
data small_swissinx [datasmall_swissinx]
223 A deeper insight into the BioSeq class
Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components
Figure 21 BioSeq class structure
6
Chapter 2 Sequences
7
Chapter 2 Sequences
Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents
8
Chapter 2 Sequences
Figure 23 BioAnnotation package structure
Example 23 Find the references to the PDB database entries
Does the protein have a known structure and if so what are its cristallographic structures
Where can you find this information within a SwissProt entry
Where can you find this information in the sequence object
9
Chapter 2 Sequences
in bioperl
PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )
if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())
print nPDB Structures join ( structures) n
Exercise 24 More on annotations
Find the function of the protein within comments (if available) (Solution A4)
Go to
Work on Exercise 26 before you continue
Exercise 25 Transmembran helices
Find transmembrane helices in the entry (Solution A5)
Tip
Use Figure 22 again
Use the appropriate objectrsquos documentation
Here is a summary [codesuseSeq10pl] solution of the exercises in this section
224 Bioperl rsquoSequencersquo classes structure
10
Chapter 2 Sequences
Figure 24 Bioperl rsquoSequencersquo classes structure
23 Features and Location classes
231 Feature
11
Chapter 2 Sequences
They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)
Documentation on features
bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]
bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]
Features Classes
bull BioSeqFeatureI
bull BioSeqFeatureGeneric
bull BioSeqFeatureFeaturePair
bull BioSeqFeatureSimilarityPair
bull BioSeqFeatureSimilarity
bull BioSeqFeatureGene
bull BioSeqFeatureGeneExonI
bull BioSeqFeatureGeneExon
bull BioSeqFeatureGeneTranscript
bull BioSeqFeatureGeneTranscriptI
bull BioSeqFeatureGeneGeneStructureI
bull BioSeqFeatureGeneGeneStructure
12
Chapter 2 Sequences
Figure 25 Features Classes structure
232 Code reading extracting CDS
Exercise 26 extractcds
code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )
fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])
13
Chapter 2 Sequences
Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]
Tip
Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class
Tip
Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class
Exercise 27 extractcds version 2
compare Exercise 26 with this version [lecture_codeextractcds2]
233 Tag system
This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])
In the following exampleCDS(feature key) is a primary tagcodon a secondary tag
CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)
The tag system also enables to create new qualifier types Examples in this tutorial
bull Blast hits as features (see Figure 41 and Exercise 416)
bull Creation of a database from a Genscan parsing (Exercise 51)
See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation
14
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 7: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/7.jpg)
List of Exercises
21 BioSeqIO 322 An universal converter 323 Display a sequence in fasta format 424 More on annotations 1025 Transmembran helices 1026 extractcds 1327 extractcds version 2 1431 Create an alignment without gaps 2641 Running Blast on a Swissprot entry 2742 Running Blast Setting parameters 2743 Running Blast Saving output 2744 Running a Remote Blast 2745 Display Blast hits 2846 Class of a Blast report 2947 Parse a Blast output file 3048 Parse Blast results on standard input 3049 Filtering hits by length 31410 Filtering hits by position 31411 Display the best hit by databank 32412 Multiple queries 32413 Extracting the subject sequence 32414 Extracting alignments 32415 Locate EST in a genome 32416 Record Blast hits as sequence features 32417 Print informations from a BPLite report 34418 Create a BioToolsBPlite from a file 34419 Create a BioToolsBPlite from standard input 34420 Parse a PSI-blast report 38421 Build a BioSimpleAlign object 39422 Code reading BioToolsGenscan module 4351 Parse a Genscan report and build a database entry 4652 Parse a Genscan report and build a database entry with the genomic sequence 4753 Build a small bioperl module (for the golden program) 48
Chapter 1 General introduction
Chapter 1 General introduction
11 Bioperl documentation
1Bioperl 10 tutorial [httpbioperlorgCorebptutorialhtml]
2Wiki Bioperl documentation [httpbioperlorgwikihtmlBioPerlFrontPagehtml]
3Modules listing [httpdocbioperlorgreleasesbioperl-101]
4httpdocbioperlorgbioperl-livesbquomacr [httpdocbioperlorgbioperl-live]
5Unix help man and perldoc
12 General bioperl classesGeneral bioperl classes [httpwwwbioperlorgimagesbioperlpdf] diagram (cf Figure 61)
1
Chapter 1 General introduction
2
Chapter 2 Sequences
Chapter 2 Sequences
21 The BioSeqIO class
211 Format Converter
Example 21 SwissProt -gt Fasta
pseudocode
bull Construct aSwissProtformatted sequences input stream
bull Construct afastaformatted sequences output stream
bull for all sequences
bull read the sequence from input stream
bull write it to output stream
in bioperl
localbinperl -w
use strict
use BioSeqIO
my $in = BioSeqIO-gtnewFh ( -file =gt rsquoltseqssprsquo-format =gt rsquoswissrsquo )
my $out = BioSeqIO-gtnewFh ( -file =gt rsquogtseqsfastarsquo-format =gt rsquofastarsquo )
print $out $_ while lt$ingt
datasseqssp [dataseqssp]
3
Chapter 2 Sequences
Exercise 21 BioSeqIO
Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)
Exercise 22 An universal converter
Modify Example 21 to program an universal converter (Solution A2)
22 Sequence classes
221 Introduction
Example 22 Loading a sequence from a remote server
pseudocode
bull Create a sequence object for theBACR_HALHASwissProtentry
bull Print its Accession number and description
in bioperl
use BioDBSwissProt
my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)
print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn
4
Chapter 2 Sequences
Exercise 23 Display a sequence in fasta format
Display theBACR_HALHAentry infastaformat (Solution A3)
222 Building mechanisms summary
de novo
use BioSeq
my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)
from a file
use BioSeqIO
my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq3 = $seqin-gtnext_seq()
my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo
-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt
by fetching an entry from a remote database
use BioDBSwissProt
my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)
by fetching an entry from a bioperl indexed local database
use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)
5
Chapter 2 Sequences
data small_swissinx [datasmall_swissinx]
223 A deeper insight into the BioSeq class
Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components
Figure 21 BioSeq class structure
6
Chapter 2 Sequences
7
Chapter 2 Sequences
Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents
8
Chapter 2 Sequences
Figure 23 BioAnnotation package structure
Example 23 Find the references to the PDB database entries
Does the protein have a known structure and if so what are its cristallographic structures
Where can you find this information within a SwissProt entry
Where can you find this information in the sequence object
9
Chapter 2 Sequences
in bioperl
PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )
if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())
print nPDB Structures join ( structures) n
Exercise 24 More on annotations
Find the function of the protein within comments (if available) (Solution A4)
Go to
Work on Exercise 26 before you continue
Exercise 25 Transmembran helices
Find transmembrane helices in the entry (Solution A5)
Tip
Use Figure 22 again
Use the appropriate objectrsquos documentation
Here is a summary [codesuseSeq10pl] solution of the exercises in this section
224 Bioperl rsquoSequencersquo classes structure
10
Chapter 2 Sequences
Figure 24 Bioperl rsquoSequencersquo classes structure
23 Features and Location classes
231 Feature
11
Chapter 2 Sequences
They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)
Documentation on features
bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]
bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]
Features Classes
bull BioSeqFeatureI
bull BioSeqFeatureGeneric
bull BioSeqFeatureFeaturePair
bull BioSeqFeatureSimilarityPair
bull BioSeqFeatureSimilarity
bull BioSeqFeatureGene
bull BioSeqFeatureGeneExonI
bull BioSeqFeatureGeneExon
bull BioSeqFeatureGeneTranscript
bull BioSeqFeatureGeneTranscriptI
bull BioSeqFeatureGeneGeneStructureI
bull BioSeqFeatureGeneGeneStructure
12
Chapter 2 Sequences
Figure 25 Features Classes structure
232 Code reading extracting CDS
Exercise 26 extractcds
code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )
fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])
13
Chapter 2 Sequences
Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]
Tip
Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class
Tip
Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class
Exercise 27 extractcds version 2
compare Exercise 26 with this version [lecture_codeextractcds2]
233 Tag system
This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])
In the following exampleCDS(feature key) is a primary tagcodon a secondary tag
CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)
The tag system also enables to create new qualifier types Examples in this tutorial
bull Blast hits as features (see Figure 41 and Exercise 416)
bull Creation of a database from a Genscan parsing (Exercise 51)
See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation
14
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 8: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/8.jpg)
Chapter 1 General introduction
Chapter 1 General introduction
11 Bioperl documentation
1Bioperl 10 tutorial [httpbioperlorgCorebptutorialhtml]
2Wiki Bioperl documentation [httpbioperlorgwikihtmlBioPerlFrontPagehtml]
3Modules listing [httpdocbioperlorgreleasesbioperl-101]
4httpdocbioperlorgbioperl-livesbquomacr [httpdocbioperlorgbioperl-live]
5Unix help man and perldoc
12 General bioperl classesGeneral bioperl classes [httpwwwbioperlorgimagesbioperlpdf] diagram (cf Figure 61)
1
Chapter 1 General introduction
2
Chapter 2 Sequences
Chapter 2 Sequences
21 The BioSeqIO class
211 Format Converter
Example 21 SwissProt -gt Fasta
pseudocode
bull Construct aSwissProtformatted sequences input stream
bull Construct afastaformatted sequences output stream
bull for all sequences
bull read the sequence from input stream
bull write it to output stream
in bioperl
localbinperl -w
use strict
use BioSeqIO
my $in = BioSeqIO-gtnewFh ( -file =gt rsquoltseqssprsquo-format =gt rsquoswissrsquo )
my $out = BioSeqIO-gtnewFh ( -file =gt rsquogtseqsfastarsquo-format =gt rsquofastarsquo )
print $out $_ while lt$ingt
datasseqssp [dataseqssp]
3
Chapter 2 Sequences
Exercise 21 BioSeqIO
Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)
Exercise 22 An universal converter
Modify Example 21 to program an universal converter (Solution A2)
22 Sequence classes
221 Introduction
Example 22 Loading a sequence from a remote server
pseudocode
bull Create a sequence object for theBACR_HALHASwissProtentry
bull Print its Accession number and description
in bioperl
use BioDBSwissProt
my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)
print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn
4
Chapter 2 Sequences
Exercise 23 Display a sequence in fasta format
Display theBACR_HALHAentry infastaformat (Solution A3)
222 Building mechanisms summary
de novo
use BioSeq
my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)
from a file
use BioSeqIO
my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq3 = $seqin-gtnext_seq()
my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo
-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt
by fetching an entry from a remote database
use BioDBSwissProt
my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)
by fetching an entry from a bioperl indexed local database
use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)
5
Chapter 2 Sequences
data small_swissinx [datasmall_swissinx]
223 A deeper insight into the BioSeq class
Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components
Figure 21 BioSeq class structure
6
Chapter 2 Sequences
7
Chapter 2 Sequences
Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents
8
Chapter 2 Sequences
Figure 23 BioAnnotation package structure
Example 23 Find the references to the PDB database entries
Does the protein have a known structure and if so what are its cristallographic structures
Where can you find this information within a SwissProt entry
Where can you find this information in the sequence object
9
Chapter 2 Sequences
in bioperl
PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )
if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())
print nPDB Structures join ( structures) n
Exercise 24 More on annotations
Find the function of the protein within comments (if available) (Solution A4)
Go to
Work on Exercise 26 before you continue
Exercise 25 Transmembran helices
Find transmembrane helices in the entry (Solution A5)
Tip
Use Figure 22 again
Use the appropriate objectrsquos documentation
Here is a summary [codesuseSeq10pl] solution of the exercises in this section
224 Bioperl rsquoSequencersquo classes structure
10
Chapter 2 Sequences
Figure 24 Bioperl rsquoSequencersquo classes structure
23 Features and Location classes
231 Feature
11
Chapter 2 Sequences
They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)
Documentation on features
bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]
bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]
Features Classes
bull BioSeqFeatureI
bull BioSeqFeatureGeneric
bull BioSeqFeatureFeaturePair
bull BioSeqFeatureSimilarityPair
bull BioSeqFeatureSimilarity
bull BioSeqFeatureGene
bull BioSeqFeatureGeneExonI
bull BioSeqFeatureGeneExon
bull BioSeqFeatureGeneTranscript
bull BioSeqFeatureGeneTranscriptI
bull BioSeqFeatureGeneGeneStructureI
bull BioSeqFeatureGeneGeneStructure
12
Chapter 2 Sequences
Figure 25 Features Classes structure
232 Code reading extracting CDS
Exercise 26 extractcds
code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )
fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])
13
Chapter 2 Sequences
Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]
Tip
Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class
Tip
Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class
Exercise 27 extractcds version 2
compare Exercise 26 with this version [lecture_codeextractcds2]
233 Tag system
This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])
In the following exampleCDS(feature key) is a primary tagcodon a secondary tag
CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)
The tag system also enables to create new qualifier types Examples in this tutorial
bull Blast hits as features (see Figure 41 and Exercise 416)
bull Creation of a database from a Genscan parsing (Exercise 51)
See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation
14
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 9: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/9.jpg)
Chapter 1 General introduction
2
Chapter 2 Sequences
Chapter 2 Sequences
21 The BioSeqIO class
211 Format Converter
Example 21 SwissProt -gt Fasta
pseudocode
bull Construct aSwissProtformatted sequences input stream
bull Construct afastaformatted sequences output stream
bull for all sequences
bull read the sequence from input stream
bull write it to output stream
in bioperl
localbinperl -w
use strict
use BioSeqIO
my $in = BioSeqIO-gtnewFh ( -file =gt rsquoltseqssprsquo-format =gt rsquoswissrsquo )
my $out = BioSeqIO-gtnewFh ( -file =gt rsquogtseqsfastarsquo-format =gt rsquofastarsquo )
print $out $_ while lt$ingt
datasseqssp [dataseqssp]
3
Chapter 2 Sequences
Exercise 21 BioSeqIO
Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)
Exercise 22 An universal converter
Modify Example 21 to program an universal converter (Solution A2)
22 Sequence classes
221 Introduction
Example 22 Loading a sequence from a remote server
pseudocode
bull Create a sequence object for theBACR_HALHASwissProtentry
bull Print its Accession number and description
in bioperl
use BioDBSwissProt
my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)
print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn
4
Chapter 2 Sequences
Exercise 23 Display a sequence in fasta format
Display theBACR_HALHAentry infastaformat (Solution A3)
222 Building mechanisms summary
de novo
use BioSeq
my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)
from a file
use BioSeqIO
my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq3 = $seqin-gtnext_seq()
my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo
-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt
by fetching an entry from a remote database
use BioDBSwissProt
my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)
by fetching an entry from a bioperl indexed local database
use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)
5
Chapter 2 Sequences
data small_swissinx [datasmall_swissinx]
223 A deeper insight into the BioSeq class
Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components
Figure 21 BioSeq class structure
6
Chapter 2 Sequences
7
Chapter 2 Sequences
Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents
8
Chapter 2 Sequences
Figure 23 BioAnnotation package structure
Example 23 Find the references to the PDB database entries
Does the protein have a known structure and if so what are its cristallographic structures
Where can you find this information within a SwissProt entry
Where can you find this information in the sequence object
9
Chapter 2 Sequences
in bioperl
PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )
if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())
print nPDB Structures join ( structures) n
Exercise 24 More on annotations
Find the function of the protein within comments (if available) (Solution A4)
Go to
Work on Exercise 26 before you continue
Exercise 25 Transmembran helices
Find transmembrane helices in the entry (Solution A5)
Tip
Use Figure 22 again
Use the appropriate objectrsquos documentation
Here is a summary [codesuseSeq10pl] solution of the exercises in this section
224 Bioperl rsquoSequencersquo classes structure
10
Chapter 2 Sequences
Figure 24 Bioperl rsquoSequencersquo classes structure
23 Features and Location classes
231 Feature
11
Chapter 2 Sequences
They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)
Documentation on features
bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]
bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]
Features Classes
bull BioSeqFeatureI
bull BioSeqFeatureGeneric
bull BioSeqFeatureFeaturePair
bull BioSeqFeatureSimilarityPair
bull BioSeqFeatureSimilarity
bull BioSeqFeatureGene
bull BioSeqFeatureGeneExonI
bull BioSeqFeatureGeneExon
bull BioSeqFeatureGeneTranscript
bull BioSeqFeatureGeneTranscriptI
bull BioSeqFeatureGeneGeneStructureI
bull BioSeqFeatureGeneGeneStructure
12
Chapter 2 Sequences
Figure 25 Features Classes structure
232 Code reading extracting CDS
Exercise 26 extractcds
code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )
fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])
13
Chapter 2 Sequences
Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]
Tip
Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class
Tip
Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class
Exercise 27 extractcds version 2
compare Exercise 26 with this version [lecture_codeextractcds2]
233 Tag system
This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])
In the following exampleCDS(feature key) is a primary tagcodon a secondary tag
CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)
The tag system also enables to create new qualifier types Examples in this tutorial
bull Blast hits as features (see Figure 41 and Exercise 416)
bull Creation of a database from a Genscan parsing (Exercise 51)
See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation
14
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 10: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/10.jpg)
Chapter 2 Sequences
Chapter 2 Sequences
21 The BioSeqIO class
211 Format Converter
Example 21 SwissProt -gt Fasta
pseudocode
bull Construct aSwissProtformatted sequences input stream
bull Construct afastaformatted sequences output stream
bull for all sequences
bull read the sequence from input stream
bull write it to output stream
in bioperl
localbinperl -w
use strict
use BioSeqIO
my $in = BioSeqIO-gtnewFh ( -file =gt rsquoltseqssprsquo-format =gt rsquoswissrsquo )
my $out = BioSeqIO-gtnewFh ( -file =gt rsquogtseqsfastarsquo-format =gt rsquofastarsquo )
print $out $_ while lt$ingt
datasseqssp [dataseqssp]
3
Chapter 2 Sequences
Exercise 21 BioSeqIO
Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)
Exercise 22 An universal converter
Modify Example 21 to program an universal converter (Solution A2)
22 Sequence classes
221 Introduction
Example 22 Loading a sequence from a remote server
pseudocode
bull Create a sequence object for theBACR_HALHASwissProtentry
bull Print its Accession number and description
in bioperl
use BioDBSwissProt
my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)
print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn
4
Chapter 2 Sequences
Exercise 23 Display a sequence in fasta format
Display theBACR_HALHAentry infastaformat (Solution A3)
222 Building mechanisms summary
de novo
use BioSeq
my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)
from a file
use BioSeqIO
my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq3 = $seqin-gtnext_seq()
my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo
-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt
by fetching an entry from a remote database
use BioDBSwissProt
my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)
by fetching an entry from a bioperl indexed local database
use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)
5
Chapter 2 Sequences
data small_swissinx [datasmall_swissinx]
223 A deeper insight into the BioSeq class
Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components
Figure 21 BioSeq class structure
6
Chapter 2 Sequences
7
Chapter 2 Sequences
Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents
8
Chapter 2 Sequences
Figure 23 BioAnnotation package structure
Example 23 Find the references to the PDB database entries
Does the protein have a known structure and if so what are its cristallographic structures
Where can you find this information within a SwissProt entry
Where can you find this information in the sequence object
9
Chapter 2 Sequences
in bioperl
PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )
if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())
print nPDB Structures join ( structures) n
Exercise 24 More on annotations
Find the function of the protein within comments (if available) (Solution A4)
Go to
Work on Exercise 26 before you continue
Exercise 25 Transmembran helices
Find transmembrane helices in the entry (Solution A5)
Tip
Use Figure 22 again
Use the appropriate objectrsquos documentation
Here is a summary [codesuseSeq10pl] solution of the exercises in this section
224 Bioperl rsquoSequencersquo classes structure
10
Chapter 2 Sequences
Figure 24 Bioperl rsquoSequencersquo classes structure
23 Features and Location classes
231 Feature
11
Chapter 2 Sequences
They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)
Documentation on features
bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]
bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]
Features Classes
bull BioSeqFeatureI
bull BioSeqFeatureGeneric
bull BioSeqFeatureFeaturePair
bull BioSeqFeatureSimilarityPair
bull BioSeqFeatureSimilarity
bull BioSeqFeatureGene
bull BioSeqFeatureGeneExonI
bull BioSeqFeatureGeneExon
bull BioSeqFeatureGeneTranscript
bull BioSeqFeatureGeneTranscriptI
bull BioSeqFeatureGeneGeneStructureI
bull BioSeqFeatureGeneGeneStructure
12
Chapter 2 Sequences
Figure 25 Features Classes structure
232 Code reading extracting CDS
Exercise 26 extractcds
code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )
fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])
13
Chapter 2 Sequences
Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]
Tip
Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class
Tip
Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class
Exercise 27 extractcds version 2
compare Exercise 26 with this version [lecture_codeextractcds2]
233 Tag system
This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])
In the following exampleCDS(feature key) is a primary tagcodon a secondary tag
CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)
The tag system also enables to create new qualifier types Examples in this tutorial
bull Blast hits as features (see Figure 41 and Exercise 416)
bull Creation of a database from a Genscan parsing (Exercise 51)
See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation
14
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 11: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/11.jpg)
Chapter 2 Sequences
Exercise 21 BioSeqIO
Find all available formats via theBioSeqIO [httpdocbioperlorgreleasesbioperl-101BioSeqIOhtml]class (Solution A1)
Exercise 22 An universal converter
Modify Example 21 to program an universal converter (Solution A2)
22 Sequence classes
221 Introduction
Example 22 Loading a sequence from a remote server
pseudocode
bull Create a sequence object for theBACR_HALHASwissProtentry
bull Print its Accession number and description
in bioperl
use BioDBSwissProt
my $database = new BioDBSwissProtmy $seq = $database-gtget_Seq_by_id(rsquoBACR_HALHArsquo)
print Seq $seq-gtaccession_number() -- $seq-gtdesc() nn
4
Chapter 2 Sequences
Exercise 23 Display a sequence in fasta format
Display theBACR_HALHAentry infastaformat (Solution A3)
222 Building mechanisms summary
de novo
use BioSeq
my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)
from a file
use BioSeqIO
my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq3 = $seqin-gtnext_seq()
my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo
-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt
by fetching an entry from a remote database
use BioDBSwissProt
my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)
by fetching an entry from a bioperl indexed local database
use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)
5
Chapter 2 Sequences
data small_swissinx [datasmall_swissinx]
223 A deeper insight into the BioSeq class
Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components
Figure 21 BioSeq class structure
6
Chapter 2 Sequences
7
Chapter 2 Sequences
Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents
8
Chapter 2 Sequences
Figure 23 BioAnnotation package structure
Example 23 Find the references to the PDB database entries
Does the protein have a known structure and if so what are its cristallographic structures
Where can you find this information within a SwissProt entry
Where can you find this information in the sequence object
9
Chapter 2 Sequences
in bioperl
PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )
if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())
print nPDB Structures join ( structures) n
Exercise 24 More on annotations
Find the function of the protein within comments (if available) (Solution A4)
Go to
Work on Exercise 26 before you continue
Exercise 25 Transmembran helices
Find transmembrane helices in the entry (Solution A5)
Tip
Use Figure 22 again
Use the appropriate objectrsquos documentation
Here is a summary [codesuseSeq10pl] solution of the exercises in this section
224 Bioperl rsquoSequencersquo classes structure
10
Chapter 2 Sequences
Figure 24 Bioperl rsquoSequencersquo classes structure
23 Features and Location classes
231 Feature
11
Chapter 2 Sequences
They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)
Documentation on features
bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]
bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]
Features Classes
bull BioSeqFeatureI
bull BioSeqFeatureGeneric
bull BioSeqFeatureFeaturePair
bull BioSeqFeatureSimilarityPair
bull BioSeqFeatureSimilarity
bull BioSeqFeatureGene
bull BioSeqFeatureGeneExonI
bull BioSeqFeatureGeneExon
bull BioSeqFeatureGeneTranscript
bull BioSeqFeatureGeneTranscriptI
bull BioSeqFeatureGeneGeneStructureI
bull BioSeqFeatureGeneGeneStructure
12
Chapter 2 Sequences
Figure 25 Features Classes structure
232 Code reading extracting CDS
Exercise 26 extractcds
code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )
fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])
13
Chapter 2 Sequences
Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]
Tip
Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class
Tip
Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class
Exercise 27 extractcds version 2
compare Exercise 26 with this version [lecture_codeextractcds2]
233 Tag system
This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])
In the following exampleCDS(feature key) is a primary tagcodon a secondary tag
CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)
The tag system also enables to create new qualifier types Examples in this tutorial
bull Blast hits as features (see Figure 41 and Exercise 416)
bull Creation of a database from a Genscan parsing (Exercise 51)
See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation
14
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 12: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/12.jpg)
Chapter 2 Sequences
Exercise 23 Display a sequence in fasta format
Display theBACR_HALHAentry infastaformat (Solution A3)
222 Building mechanisms summary
de novo
use BioSeq
my $seq1 = BioSeq-gtnew ( -seq =gt rsquoATGAGTAGTAGTAAAGGTArsquo-id =gt rsquomy seqrsquo-desc =gt rsquothis is a new Seqrsquo)
from a file
use BioSeqIO
my $seqin = BioSeqIO-gtnew ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq3 = $seqin-gtnext_seq()
my $seqin2 = BioSeqIO-gtnewFh ( -file =gt rsquoseqfastarsquo-format =gt rsquofastarsquo)
my $seq4 = lt$seqin2gtmy $seqin3 = BioSeqIO-gtnewFh ( -file =gt rsquogolden spTAUD_ECOLI |rsquo
-format =gt rsquoswissrsquo)my $seq5 = lt$seqin3gt
by fetching an entry from a remote database
use BioDBSwissProt
my $banque = new BioDBSwissProtmy $seq6 = $banque-gtget_Seq_by_id(rsquoKPY1_ECOLIrsquo)
by fetching an entry from a bioperl indexed local database
use BioIndexSwissprotmy $inx = BioIndexSwissprot-gtnew( -filename =gt rsquosmall_swissinxrsquo)my $seq7 = $inx-gtfetch(rsquo100K_RATrsquo)
5
Chapter 2 Sequences
data small_swissinx [datasmall_swissinx]
223 A deeper insight into the BioSeq class
Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components
Figure 21 BioSeq class structure
6
Chapter 2 Sequences
7
Chapter 2 Sequences
Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents
8
Chapter 2 Sequences
Figure 23 BioAnnotation package structure
Example 23 Find the references to the PDB database entries
Does the protein have a known structure and if so what are its cristallographic structures
Where can you find this information within a SwissProt entry
Where can you find this information in the sequence object
9
Chapter 2 Sequences
in bioperl
PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )
if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())
print nPDB Structures join ( structures) n
Exercise 24 More on annotations
Find the function of the protein within comments (if available) (Solution A4)
Go to
Work on Exercise 26 before you continue
Exercise 25 Transmembran helices
Find transmembrane helices in the entry (Solution A5)
Tip
Use Figure 22 again
Use the appropriate objectrsquos documentation
Here is a summary [codesuseSeq10pl] solution of the exercises in this section
224 Bioperl rsquoSequencersquo classes structure
10
Chapter 2 Sequences
Figure 24 Bioperl rsquoSequencersquo classes structure
23 Features and Location classes
231 Feature
11
Chapter 2 Sequences
They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)
Documentation on features
bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]
bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]
Features Classes
bull BioSeqFeatureI
bull BioSeqFeatureGeneric
bull BioSeqFeatureFeaturePair
bull BioSeqFeatureSimilarityPair
bull BioSeqFeatureSimilarity
bull BioSeqFeatureGene
bull BioSeqFeatureGeneExonI
bull BioSeqFeatureGeneExon
bull BioSeqFeatureGeneTranscript
bull BioSeqFeatureGeneTranscriptI
bull BioSeqFeatureGeneGeneStructureI
bull BioSeqFeatureGeneGeneStructure
12
Chapter 2 Sequences
Figure 25 Features Classes structure
232 Code reading extracting CDS
Exercise 26 extractcds
code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )
fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])
13
Chapter 2 Sequences
Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]
Tip
Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class
Tip
Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class
Exercise 27 extractcds version 2
compare Exercise 26 with this version [lecture_codeextractcds2]
233 Tag system
This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])
In the following exampleCDS(feature key) is a primary tagcodon a secondary tag
CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)
The tag system also enables to create new qualifier types Examples in this tutorial
bull Blast hits as features (see Figure 41 and Exercise 416)
bull Creation of a database from a Genscan parsing (Exercise 51)
See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation
14
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 13: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/13.jpg)
Chapter 2 Sequences
data small_swissinx [datasmall_swissinx]
223 A deeper insight into the BioSeq class
Figure 21 shows the class structure of theBioSeq class TheBioAnnotation package andBioSeqFeature package are more detailed in Figure 23 and Figure 25 Figure 22 shows the relationbetween a SwissProt entry and the correspondingBioSeq components
Figure 21 BioSeq class structure
6
Chapter 2 Sequences
7
Chapter 2 Sequences
Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents
8
Chapter 2 Sequences
Figure 23 BioAnnotation package structure
Example 23 Find the references to the PDB database entries
Does the protein have a known structure and if so what are its cristallographic structures
Where can you find this information within a SwissProt entry
Where can you find this information in the sequence object
9
Chapter 2 Sequences
in bioperl
PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )
if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())
print nPDB Structures join ( structures) n
Exercise 24 More on annotations
Find the function of the protein within comments (if available) (Solution A4)
Go to
Work on Exercise 26 before you continue
Exercise 25 Transmembran helices
Find transmembrane helices in the entry (Solution A5)
Tip
Use Figure 22 again
Use the appropriate objectrsquos documentation
Here is a summary [codesuseSeq10pl] solution of the exercises in this section
224 Bioperl rsquoSequencersquo classes structure
10
Chapter 2 Sequences
Figure 24 Bioperl rsquoSequencersquo classes structure
23 Features and Location classes
231 Feature
11
Chapter 2 Sequences
They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)
Documentation on features
bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]
bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]
Features Classes
bull BioSeqFeatureI
bull BioSeqFeatureGeneric
bull BioSeqFeatureFeaturePair
bull BioSeqFeatureSimilarityPair
bull BioSeqFeatureSimilarity
bull BioSeqFeatureGene
bull BioSeqFeatureGeneExonI
bull BioSeqFeatureGeneExon
bull BioSeqFeatureGeneTranscript
bull BioSeqFeatureGeneTranscriptI
bull BioSeqFeatureGeneGeneStructureI
bull BioSeqFeatureGeneGeneStructure
12
Chapter 2 Sequences
Figure 25 Features Classes structure
232 Code reading extracting CDS
Exercise 26 extractcds
code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )
fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])
13
Chapter 2 Sequences
Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]
Tip
Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class
Tip
Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class
Exercise 27 extractcds version 2
compare Exercise 26 with this version [lecture_codeextractcds2]
233 Tag system
This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])
In the following exampleCDS(feature key) is a primary tagcodon a secondary tag
CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)
The tag system also enables to create new qualifier types Examples in this tutorial
bull Blast hits as features (see Figure 41 and Exercise 416)
bull Creation of a database from a Genscan parsing (Exercise 51)
See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation
14
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 14: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/14.jpg)
Chapter 2 Sequences
7
Chapter 2 Sequences
Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents
8
Chapter 2 Sequences
Figure 23 BioAnnotation package structure
Example 23 Find the references to the PDB database entries
Does the protein have a known structure and if so what are its cristallographic structures
Where can you find this information within a SwissProt entry
Where can you find this information in the sequence object
9
Chapter 2 Sequences
in bioperl
PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )
if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())
print nPDB Structures join ( structures) n
Exercise 24 More on annotations
Find the function of the protein within comments (if available) (Solution A4)
Go to
Work on Exercise 26 before you continue
Exercise 25 Transmembran helices
Find transmembrane helices in the entry (Solution A5)
Tip
Use Figure 22 again
Use the appropriate objectrsquos documentation
Here is a summary [codesuseSeq10pl] solution of the exercises in this section
224 Bioperl rsquoSequencersquo classes structure
10
Chapter 2 Sequences
Figure 24 Bioperl rsquoSequencersquo classes structure
23 Features and Location classes
231 Feature
11
Chapter 2 Sequences
They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)
Documentation on features
bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]
bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]
Features Classes
bull BioSeqFeatureI
bull BioSeqFeatureGeneric
bull BioSeqFeatureFeaturePair
bull BioSeqFeatureSimilarityPair
bull BioSeqFeatureSimilarity
bull BioSeqFeatureGene
bull BioSeqFeatureGeneExonI
bull BioSeqFeatureGeneExon
bull BioSeqFeatureGeneTranscript
bull BioSeqFeatureGeneTranscriptI
bull BioSeqFeatureGeneGeneStructureI
bull BioSeqFeatureGeneGeneStructure
12
Chapter 2 Sequences
Figure 25 Features Classes structure
232 Code reading extracting CDS
Exercise 26 extractcds
code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )
fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])
13
Chapter 2 Sequences
Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]
Tip
Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class
Tip
Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class
Exercise 27 extractcds version 2
compare Exercise 26 with this version [lecture_codeextractcds2]
233 Tag system
This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])
In the following exampleCDS(feature key) is a primary tagcodon a secondary tag
CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)
The tag system also enables to create new qualifier types Examples in this tutorial
bull Blast hits as features (see Figure 41 and Exercise 416)
bull Creation of a database from a Genscan parsing (Exercise 51)
See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation
14
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 15: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/15.jpg)
Chapter 2 Sequences
Figure 22 Relation of a SwissProt entry and the corresponding BioSeq object compo-nents
8
Chapter 2 Sequences
Figure 23 BioAnnotation package structure
Example 23 Find the references to the PDB database entries
Does the protein have a known structure and if so what are its cristallographic structures
Where can you find this information within a SwissProt entry
Where can you find this information in the sequence object
9
Chapter 2 Sequences
in bioperl
PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )
if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())
print nPDB Structures join ( structures) n
Exercise 24 More on annotations
Find the function of the protein within comments (if available) (Solution A4)
Go to
Work on Exercise 26 before you continue
Exercise 25 Transmembran helices
Find transmembrane helices in the entry (Solution A5)
Tip
Use Figure 22 again
Use the appropriate objectrsquos documentation
Here is a summary [codesuseSeq10pl] solution of the exercises in this section
224 Bioperl rsquoSequencersquo classes structure
10
Chapter 2 Sequences
Figure 24 Bioperl rsquoSequencersquo classes structure
23 Features and Location classes
231 Feature
11
Chapter 2 Sequences
They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)
Documentation on features
bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]
bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]
Features Classes
bull BioSeqFeatureI
bull BioSeqFeatureGeneric
bull BioSeqFeatureFeaturePair
bull BioSeqFeatureSimilarityPair
bull BioSeqFeatureSimilarity
bull BioSeqFeatureGene
bull BioSeqFeatureGeneExonI
bull BioSeqFeatureGeneExon
bull BioSeqFeatureGeneTranscript
bull BioSeqFeatureGeneTranscriptI
bull BioSeqFeatureGeneGeneStructureI
bull BioSeqFeatureGeneGeneStructure
12
Chapter 2 Sequences
Figure 25 Features Classes structure
232 Code reading extracting CDS
Exercise 26 extractcds
code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )
fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])
13
Chapter 2 Sequences
Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]
Tip
Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class
Tip
Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class
Exercise 27 extractcds version 2
compare Exercise 26 with this version [lecture_codeextractcds2]
233 Tag system
This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])
In the following exampleCDS(feature key) is a primary tagcodon a secondary tag
CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)
The tag system also enables to create new qualifier types Examples in this tutorial
bull Blast hits as features (see Figure 41 and Exercise 416)
bull Creation of a database from a Genscan parsing (Exercise 51)
See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation
14
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 16: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/16.jpg)
Chapter 2 Sequences
Figure 23 BioAnnotation package structure
Example 23 Find the references to the PDB database entries
Does the protein have a known structure and if so what are its cristallographic structures
Where can you find this information within a SwissProt entry
Where can you find this information in the sequence object
9
Chapter 2 Sequences
in bioperl
PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )
if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())
print nPDB Structures join ( structures) n
Exercise 24 More on annotations
Find the function of the protein within comments (if available) (Solution A4)
Go to
Work on Exercise 26 before you continue
Exercise 25 Transmembran helices
Find transmembrane helices in the entry (Solution A5)
Tip
Use Figure 22 again
Use the appropriate objectrsquos documentation
Here is a summary [codesuseSeq10pl] solution of the exercises in this section
224 Bioperl rsquoSequencersquo classes structure
10
Chapter 2 Sequences
Figure 24 Bioperl rsquoSequencersquo classes structure
23 Features and Location classes
231 Feature
11
Chapter 2 Sequences
They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)
Documentation on features
bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]
bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]
Features Classes
bull BioSeqFeatureI
bull BioSeqFeatureGeneric
bull BioSeqFeatureFeaturePair
bull BioSeqFeatureSimilarityPair
bull BioSeqFeatureSimilarity
bull BioSeqFeatureGene
bull BioSeqFeatureGeneExonI
bull BioSeqFeatureGeneExon
bull BioSeqFeatureGeneTranscript
bull BioSeqFeatureGeneTranscriptI
bull BioSeqFeatureGeneGeneStructureI
bull BioSeqFeatureGeneGeneStructure
12
Chapter 2 Sequences
Figure 25 Features Classes structure
232 Code reading extracting CDS
Exercise 26 extractcds
code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )
fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])
13
Chapter 2 Sequences
Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]
Tip
Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class
Tip
Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class
Exercise 27 extractcds version 2
compare Exercise 26 with this version [lecture_codeextractcds2]
233 Tag system
This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])
In the following exampleCDS(feature key) is a primary tagcodon a secondary tag
CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)
The tag system also enables to create new qualifier types Examples in this tutorial
bull Blast hits as features (see Figure 41 and Exercise 416)
bull Creation of a database from a Genscan parsing (Exercise 51)
See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation
14
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 17: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/17.jpg)
Chapter 2 Sequences
in bioperl
PDB structures entriesmy structures = ()foreach my $link ( $annotation-gtget_Annotations(rsquodblinkrsquo) )
if ($link-gtdatabase() eq rsquoPDBrsquo) push (structures $link-gtprimary_id())
print nPDB Structures join ( structures) n
Exercise 24 More on annotations
Find the function of the protein within comments (if available) (Solution A4)
Go to
Work on Exercise 26 before you continue
Exercise 25 Transmembran helices
Find transmembrane helices in the entry (Solution A5)
Tip
Use Figure 22 again
Use the appropriate objectrsquos documentation
Here is a summary [codesuseSeq10pl] solution of the exercises in this section
224 Bioperl rsquoSequencersquo classes structure
10
Chapter 2 Sequences
Figure 24 Bioperl rsquoSequencersquo classes structure
23 Features and Location classes
231 Feature
11
Chapter 2 Sequences
They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)
Documentation on features
bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]
bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]
Features Classes
bull BioSeqFeatureI
bull BioSeqFeatureGeneric
bull BioSeqFeatureFeaturePair
bull BioSeqFeatureSimilarityPair
bull BioSeqFeatureSimilarity
bull BioSeqFeatureGene
bull BioSeqFeatureGeneExonI
bull BioSeqFeatureGeneExon
bull BioSeqFeatureGeneTranscript
bull BioSeqFeatureGeneTranscriptI
bull BioSeqFeatureGeneGeneStructureI
bull BioSeqFeatureGeneGeneStructure
12
Chapter 2 Sequences
Figure 25 Features Classes structure
232 Code reading extracting CDS
Exercise 26 extractcds
code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )
fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])
13
Chapter 2 Sequences
Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]
Tip
Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class
Tip
Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class
Exercise 27 extractcds version 2
compare Exercise 26 with this version [lecture_codeextractcds2]
233 Tag system
This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])
In the following exampleCDS(feature key) is a primary tagcodon a secondary tag
CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)
The tag system also enables to create new qualifier types Examples in this tutorial
bull Blast hits as features (see Figure 41 and Exercise 416)
bull Creation of a database from a Genscan parsing (Exercise 51)
See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation
14
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 18: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/18.jpg)
Chapter 2 Sequences
Figure 24 Bioperl rsquoSequencersquo classes structure
23 Features and Location classes
231 Feature
11
Chapter 2 Sequences
They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)
Documentation on features
bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]
bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]
Features Classes
bull BioSeqFeatureI
bull BioSeqFeatureGeneric
bull BioSeqFeatureFeaturePair
bull BioSeqFeatureSimilarityPair
bull BioSeqFeatureSimilarity
bull BioSeqFeatureGene
bull BioSeqFeatureGeneExonI
bull BioSeqFeatureGeneExon
bull BioSeqFeatureGeneTranscript
bull BioSeqFeatureGeneTranscriptI
bull BioSeqFeatureGeneGeneStructureI
bull BioSeqFeatureGeneGeneStructure
12
Chapter 2 Sequences
Figure 25 Features Classes structure
232 Code reading extracting CDS
Exercise 26 extractcds
code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )
fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])
13
Chapter 2 Sequences
Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]
Tip
Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class
Tip
Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class
Exercise 27 extractcds version 2
compare Exercise 26 with this version [lecture_codeextractcds2]
233 Tag system
This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])
In the following exampleCDS(feature key) is a primary tagcodon a secondary tag
CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)
The tag system also enables to create new qualifier types Examples in this tutorial
bull Blast hits as features (see Figure 41 and Exercise 416)
bull Creation of a database from a Genscan parsing (Exercise 51)
See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation
14
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 19: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/19.jpg)
Chapter 2 Sequences
They are created either by parsing database entries (see BioSeqIOFTHelper [httpdocbioperlorgreleasesbioperl-10BioSeqIOFTHelperhtml] or Figure A1) or by parsing programs output For instance(see Exercise 416) a Blast HSP is actually an instance of class BioSeqFeatureSimilarityPair[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtml] which is a sub-class ofBioSeqFeatureFeaturePair [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureFeaturePairhtml]which again is a feature The same holds for Genscan predictions (Exercise 51)
Documentation on features
bull Features table official documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtml]
bull bioperl tutorial [httpbioperlorgCorebptutorialhtmlIII_7_1_Representing_sequence_an]
Features Classes
bull BioSeqFeatureI
bull BioSeqFeatureGeneric
bull BioSeqFeatureFeaturePair
bull BioSeqFeatureSimilarityPair
bull BioSeqFeatureSimilarity
bull BioSeqFeatureGene
bull BioSeqFeatureGeneExonI
bull BioSeqFeatureGeneExon
bull BioSeqFeatureGeneTranscript
bull BioSeqFeatureGeneTranscriptI
bull BioSeqFeatureGeneGeneStructureI
bull BioSeqFeatureGeneGeneStructure
12
Chapter 2 Sequences
Figure 25 Features Classes structure
232 Code reading extracting CDS
Exercise 26 extractcds
code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )
fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])
13
Chapter 2 Sequences
Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]
Tip
Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class
Tip
Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class
Exercise 27 extractcds version 2
compare Exercise 26 with this version [lecture_codeextractcds2]
233 Tag system
This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])
In the following exampleCDS(feature key) is a primary tagcodon a secondary tag
CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)
The tag system also enables to create new qualifier types Examples in this tutorial
bull Blast hits as features (see Figure 41 and Exercise 416)
bull Creation of a database from a Genscan parsing (Exercise 51)
See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation
14
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 20: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/20.jpg)
Chapter 2 Sequences
Figure 25 Features Classes structure
232 Code reading extracting CDS
Exercise 26 extractcds
code [lecture_codeextractcds] (download [ftpftppasteurfrpubGenSoftunixnucleic_acidextractcds] )
fetch an EMBL or Genbank entry and try this code with for instancegbAB042770 (seq2gb [dataseq2gb])
13
Chapter 2 Sequences
Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]
Tip
Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class
Tip
Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class
Exercise 27 extractcds version 2
compare Exercise 26 with this version [lecture_codeextractcds2]
233 Tag system
This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])
In the following exampleCDS(feature key) is a primary tagcodon a secondary tag
CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)
The tag system also enables to create new qualifier types Examples in this tutorial
bull Blast hits as features (see Figure 41 and Exercise 416)
bull Creation of a database from a Genscan parsing (Exercise 51)
See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation
14
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 21: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/21.jpg)
Chapter 2 Sequences
Web form [httpbiowebpasteurfrseqanalinterfacesextractcdshtml]
Tip
Have a look at the BioSeqFeatureI [httpdocbioperlorgreleasesbioperl-101BioSeqFeatureIhtml] documentation Figure 25 shows the architecture of theBioSeqFeature class
Tip
Figure 26 shows the relation of features in an Embl entry and theBioSeqFeature class
Exercise 27 extractcds version 2
compare Exercise 26 with this version [lecture_codeextractcds2]
233 Tag system
This system enables you to refer to feature qualifiers (see documentation [httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmlqualifiers])
In the following exampleCDS(feature key) is a primary tagcodon a secondary tag
CDS 11000codon=(seqcugaaSer)codon=(seqtgaaaTrp)
The tag system also enables to create new qualifier types Examples in this tutorial
bull Blast hits as features (see Figure 41 and Exercise 416)
bull Creation of a database from a Genscan parsing (Exercise 51)
See methods related to tags in BioSeqFeatureGeneric [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtml] documentation
14
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 22: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/22.jpg)
Chapter 2 Sequences
Figure 26 Correspondance between an EMBL entry and bioperl tags
234 Location
Locations are useful for everything that looks like a range within a sequence (cfBioRangeI ) sequencesassociated to features sequences belonging to an alignment
Coordinates types There are several coordinates types and representations you can usefuzzy locations ranges lists to specify the begin or the end (see the documentation
15
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 23: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/23.jpg)
Chapter 2 Sequences
[httpwwwebiacukemblDocumentationFT_definitionsfeature_tablehtmllocation] ) example takenfrom BioLocationFuzzy [httpdocbioperlorgreleasesbioperl-10BioLocationFuzzyhtml]
my $fuzzylocation = new BioLocationFuzzy(-start =gt rsquolt30rsquo-end =gt 90-loc_type =gt rsquorsquo)
which specifies a location whose begin is before 30 and whose end is at 90 Coding
bull rsquorsquo EXACT
bull rsquo^rsquo BETWEEN (a site between 2 bases) exemple 123^124 a site between bases 123 and 124145^177a site between 2 adjacent bases somewhere between bases 145 and 177
bull rsquorsquo WITHIN (a base within a range) example (102110) is a site wihtin the 102 110 range inclusive
bull rsquoltrsquo BEFORE
bull rsquogtrsquo AFTER
Location Classes
bull BioLocationI
bull BioLocationSimple
bull BioLocationSplitLocationIet BioLocationSplit
bull BioLocationFuzzyLocationIet BioLocationFuzzy
bull Coordinate policies for fuzzy location start and end determination
bull interface BioLocationCoordinatePolicyI
bull default BioLocationWidestCoordPolicy
bull BioLocationNarrowestCoordPolicy BioLocationAvWithinCoordPolicy
16
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 24: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/24.jpg)
Chapter 2 Sequences
Figure 27 Location Classes Structure
235 Graphical view of features
The version 10 of bioperl has a package to generate annotation images of sequences The general procedure is tocreate aBioGraphicsPanel object that holds the image and the sequence to annotate Then you can addannotations line per line by using theadd_track method of the panel object TheBioGraphicsGlyphpackage contains several prdefinded styles to show annotations but you also have the possibility to write your ownglyphs
17
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 25: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/25.jpg)
Chapter 2 Sequences
Figure 28 shows some features of the SwissProt entryBACR_HALHA Here [codesfeature_graphpl] is the thatgenerates the image The helix image and the arrow of domains are added glyph types that can be found here[modules]
Figure 28 Graphical view of some features of the SwissProt entry BACR_HALHA
24 Sequence analysis toolsBioSeq [httpdocbioperlorgreleasesbioperl-101BioSeqhtml] ( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_1_Manipulating_sequence_da])
$seqobj-gtseq
sequence as string$seqobj-gtsubseq(1040) sub-sequence as string$seqobj-gttrunc(10100) sub-sequence as fresh BioPrimarySeq
18
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 26: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/26.jpg)
Chapter 2 Sequences
$seqobj-gttranslate
BioToolsSeqStats [httpdocbioperlorgreleasesbioperl-101BioToolsSeqStatshtml] ( tuto-rial [httpbioperlorgCorePODbptutorialhtmlIII_3_2_Obtaining_basic_sequence])
$seq_stats = BioToolsSeqStats-gtnew(-seq=gt$seqobj)$seq_stats-gtcount_monomers()$seq_stats-gtcount_codons()$weight = $seq_stats-gtget_mol_wt($seqobj)
et BioToolsSeqWords [httpdocbioperlorgreleasesbioperl-101BioToolsSeqWordshtml]
$seq_word = BioToolsSeqWords-gtnew(-seq =gt $seqobj)$seq_stats-gtcount_words($word_length)
BioToolsSeqPattern [httpdocbioperlorgreleasesbioperl-101BioToolsSeqPatternhtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_5_Miscellaneous_sequence_u])
$pat = rsquoT[GA]AATAATrsquo$pattern = new BioToolsSeqPattern(-SEQ =gt$pat -TYPE =gtrsquoDnarsquo)$pattern-gtexpand$pattern-gtrevcom$pattern-gtalphabet_ok
BioSeqUtils [httpdocbioperlorgreleasesbioperl-101BioToolsSeqUtilshtml]
$util = new BioSeqUtils$polypeptide_3char = $util-gtseq3($seqobj)
or
$polypeptide_3char = BioSeqUtils-gtseq3($seqobj)
BioToolsRestrictionEnzyme [httpdocbioperlorgreleasesbioperl-101BioToolsRestrictionEnzymehtml]( tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_3_Identifying_restriction_])
$re1 = new BioToolsRestrictionEnzyme(-NAME =gtrsquoEcoRIrsquo)fragments = $re1-gtcut_seq($seqobj)$re2 = new BioToolsRestrictionEnzyme( -NAME =gtrsquoEcoRV--GAT^ATCrsquo creates a new one
19
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 27: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/27.jpg)
Chapter 2 Sequences
-MAKE =gtrsquocustomrsquo)
BioToolsSigcleave [httpdocbioperlorgreleasesbioperl-101BioToolsSigcleavehtml] (tutorial [httpbioperlorgCorePODbptutorialhtmlIII_3_4_Identifying_amino_acid_c])
$sigcleave_object = new BioToolsSigcleave (rsquo-filersquo=gtrsquosigtestaarsquo not a BioSeq object rsquo-thresholdrsquo=gtrsquo35rsquo
rsquo-descrsquo=gtrsquotest sigcleave protein seqrsquorsquo-typersquo=gtrsquoAMINO rsquo)
raw_results = $sigcleave_object-gtsignals$formatted_output = $sigcleave_object-gtpretty_print
20
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 28: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/28.jpg)
Chapter 3 Alignments
Chapter 3 Alignments
31 AlignIOThe AlignIO class is designed the same as theSeqIO class Therefore format conversions can be done asfollowing
Example 31 Format conversions with AlignIO
(data [datainfilealn])
localbinperl -w
use strictuse BioAlignIO
my $inform = shift ARGV || rsquoclustalwrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioAlignIO-gtnewFh ( -fh =gt STDIN -format =gt $inform )my $out = BioAlignIO-gtnewFh ( -fh =gt STDOUT -format =gt $outform )
print $out $_ while lt$ingt
21
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 29: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/29.jpg)
Chapter 3 Alignments
AlignIO class structureWarning some alignment formats are available only as input formats
Figure 31 AlignIO Classes diagram
32 SimpleAlignAt first letrsquos see some basic methods of theSimpleAlign class
Example 32 Basic methods of SimpleAlign
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )my $aln = $in-gtnext_aln()
print same length of all sequences ($aln-gtis_flush()) yes no n
print alignment lenght $aln-gtlength() nprintf identity 2f n $aln-gtpercentage_identity()printf identity of conserved columns 2f n $aln-gtoverall_percentage_identity()
22
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 30: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/30.jpg)
Chapter 3 Alignments
23
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 31: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/31.jpg)
Chapter 3 Alignments
SimpleAlign class diagram It contains theAlignI interface that declares all methods to be implemented byAlignment classes The diagram integrates also some classes that can createSimpleAlign objects
Figure 32 Align Classes diagram
The SimpleAlign class contains methods to select sequences or columns but it can not filter alignments byfunctions as could be done by theUnivAln class of the old bioperl release In order to filter columns by
24
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 32: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/32.jpg)
Chapter 3 Alignments
properties you have to extract the columns by yourself filter them and reconstruct the new sequences Thefollowing example filters gap columns
Example 33 Filter gap columns
localbinperl -w
use strictuse BioAlignIO
my $in = new BioAlignIO ( -file =ampgt $ARGV[0] -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =ampgt STDOUT -format =gt rsquoclustalwrsquo )
my $aln = $in-gtnext_aln()
if you need to working on the columns create a list containing all columns as strings
my aln_cols = ()
foreach my $seq ( $aln-gteach_alphabetically() ) my $colnr = 0foreach my $chr ( split( $seq-gtseq()) )
$aln_cols[$colnr] = $chr$colnr++
then do the work we want to eliminate all the columns containing gaps 1 we create a list containing all the columns without any gap
my $gapchar = $aln-ampgtgap_char()my no_gap_cols = ()foreach my $col ( aln_cols )
next if $col =~ Q$gapcharEpush no_gap_cols $col
2 we modify the sequences in the alignment 2a we reconstruct the sequence strings
my seq_strs = ()foreach my $col ( no_gap_cols )
my $colnr = 0foreach my $chr ( split $col )
$seq_strs[$colnr] = $chr$colnr++
2b we replace the old sequences strings with the new ones
25
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 33: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/33.jpg)
Chapter 3 Alignments
foreach my $seq ( $aln-gteach_alphabetically() ) $seq-gtseq(shift seq_strs)
print $out $aln
Exercise 31 Create an alignment without gaps
Create a new alignment without gaps at the begining and the end of the alignment
bull sample alignment file [datainfilealn]
bull Hint you donrsquot have to extract all the columns as in the second example
Solution A6
33 Code reading protal2dnaThis script aligns DNA sequences given their protein alignment
bull code [lecture_codeprotal2dnahtml] (download)
bull bioperl classes
bull BioAlignIO [httpdocbioperlorgreleasesbioperl-10BioAlignIOhtml]
bull BioSimpleAlign [httpdocbioperlorgreleasesbioperl-10BioSimpleAlignhtml]
bull BioToolsCodonTable [httpdocbioperlorgreleasesbioperl-10BioToolsCodonTablehtml]
bull Web form [httpbiowebseqanalinterfacesprotal2dnahtml]
26
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 34: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/34.jpg)
Chapter 4 Analysis
Chapter 4 Analysis
41 Blast
411 Running Blast
4111 Running local Blast with BioToolsRunStandAloneBlast
Example 41 StandAloneBlast run
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0] -format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
You can try it with this file [data1protfasta]
Exercise 41 Running Blast on a Swissprot entry
Run the above example with the TAUD_ECOLI entry from Swissprot (see Section 222) )Solution A7
Exercise 42 Running Blast Setting parameters
Change the Expectation value (E) Blast parameter to 10 (instead of default value 100) (query [data1protfasta])Solution A8
Exercise 43 Running Blast Saving output
Save the blast report in a local file (egblastout )Solution A9
27
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 35: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/35.jpg)
Chapter 4 Analysis
Exercise 44 Running a Remote Blast
Find out from the bioperl Web site documentation how to run a Blast at NCBISolution A10
412 Parsing Blast
4121 Parsing from a Blast run result
pseudocode
bull Run Blast
bull Create a report
bull For all hits (class BioSearchHitGenericHit [httpdocbioperlorgreleasesbioperl-10BioSearchHitGenericHithtml])
bull print its name and database
bull for all HSP (class BioSearchHSPGenericHSP [httpdocbioperlorgreleasesbioperl-10BioSearchHSPGenericHSPhtml])
bull print some hit statistics
28
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 36: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/36.jpg)
Chapter 4 Analysis
Example 42 StandAloneBlast parsing
use BioSeqIOuse BioToolsRunStandAloneBlastmy $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)my $query = $Seq_in-gtnext_seq()my $factory = BioToolsRunStandAloneBlast-gtnew( rsquoprogramrsquo =gt rsquoblastprsquo
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast )
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() n
while( my $hsp = $hit-gtnext_hsp()) print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
$result belongs to the BioSearchResultGenericResult class where you can find the documentation aboutavailable methods on results
Exercise 45 Display Blast hits
Display hit accession and hsp lengthSolution A11
Exercise 46 Class of a Blast report
What is the class of$blast_report (hint use theref() perl function)Solution A12
4122 Parsing from a file
The$blast_report that is created by the BioToolsRunStandAloneBlast new method actually belongs totheBioSearchIO class So you can directly create such an object from a file
Example 43 Parsing from a Blast file
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
29
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 37: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/37.jpg)
Chapter 4 Analysis
my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Exercise 47 Parse a Blast output file
Run Example 43 with the file you have just created in Exercise 43
Exercise 48 Parse Blast results on standard input
Run Example 43 with the blast report given on standard inputSolution A13
4123 Blast Classes diagram
30
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 38: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/38.jpg)
Chapter 4 Analysis
Figure 41 Blast Classes diagram
4124 Filtering Blast output
Exercise 49 Filtering hits by length
Only display hits of length gt to a given value (create a blast report from this file [datablastout])Solution A14
31
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 39: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/39.jpg)
Chapter 4 Analysis
Exercise 410 Filtering hits by position
Only display hits between two positionsSolution A15
Exercise 411 Display the best hit by databank
Display the best hit for each database in a Blast NCBI report (example [datancbi-blastout])
bull If necessary you can find here adbname(hit-gtname) [solutiondbnamepl] function to extract databasename from the hit name
bull You may use a perl hash to mark a database as already done
$done$database = 1
Solution A16
Exercise 412 Multiple queries
Submit two sequences and display the best hit for each (2nd query [dataP42704fasta])Solution A17
4125 Extracting data from a Blast report
Exercise 413 Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given levelSolution A18
Exercise 414 Extracting alignments
Extract HSP alignment from hits which name contains a given pattern and print this alignment in Clustalw formatSolution A19
Exercise 415 Locate EST in a genome
Locate EST in a genome
bull displays the occurrences of this EST [dataest] in this contig [datacontig] of a genome
bull You may first display this as a Clustalw alignment (on sequence by HSP)
bull Blast report [datablast-estout]
Solution A20
32
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 40: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/40.jpg)
Chapter 4 Analysis
Exercise 416 Record Blast hits as sequence features
Iterate a local Blast on databases given on the command line and record each corresponding best hit as a featurefor the query sequenceSolution A21
413 BioToolsBPlite family parsers
The old BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] family ofparsers (BPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] BioToolsBPpsilite[httpdocbioperlorgreleasesbioperl-10BioToolsBPpsilitehtml] BioToolsBPbl2seq [httpdocbioperlorgreleasesbioperl-10BioToolsBPbl2seqhtml]) is still maintained in the 10 release although is will be deprecated one day It isactually still necessary for blasting two sequences and to perform a psiblast (Position Specific Iterative Blast) Itis also still the default parser returned by BioToolsRunStandAloneBlast as shown below (although this mightchange soon)
Example 44 Parsing with BPLite
In the following example (notice that we have removed the_READMETHODparameter) the$blast_report does not belong to the BioSearchIO [httpdocbioperlorgreleasesbioperl-10BioSearchIOhtml] class but to the BioToolsBPlite [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtml] class That is why the method to use in order to get informations are differentnamely nextSbjct [httpdocbioperlorgreleasesbioperl-10BioToolsBPlitehtmlPOD8] and nextHSP[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtmlPOD4] of class BioToolsBPliteSbjct[httpdocbioperlorgreleasesbioperl-10BioToolsBPliteSbjcthtml] instead ofnext_hit andnext_hsp There is nonext_result method since these parsers were not able to deal with multiple reports
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
33
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 41: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/41.jpg)
Chapter 4 Analysis
Exercise 417 Print informations from a BPLite report
Take Example 44 and print hit names and for all HSP its P value percent identity and score (seeBioToolsBPliteHSP)Solution A22
Exercise 418 Create a BioToolsBPlite from a file
Create a BioToolsBPlite from a fileSolution A23
Exercise 419 Create a BioToolsBPlite from standard input
Create a BioToolsBPlite from standard inputSolution A24
34
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 42: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/42.jpg)
Chapter 4 Analysis
35
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 43: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/43.jpg)
Chapter 4 Analysis
Figure 42 BPLite Classes diagram
36
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 44: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/44.jpg)
Chapter 4 Analysis
414 PSI-BLAST (Position Specific Iterative Blast)
See documentation [httpwwwncbinlmnihgovBLASTtutorialAltschul-2html] or tutorial [httpwwwncbinlmnihgovEducationBLASTinfopsi1html]at NCBI
Example 45 Running PSI-blast
The following example run a PSI-blast
use BioSeqIOuse BioToolsRunStandAloneBlast
my $query_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query = lt$query_ingt$factory = BioToolsRunStandAloneBlast-gtnew(rsquodatabasersquo =gt rsquoswissprotrsquo)$factory-gtj(2)$blast_report = $factory-gtblastpgp($query)
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print iteration $iterationnmy $result = $blast_report-gtround($iteration)
while( my $hit = $result-gtnextSbjct()) print thit name $hit-gtname() nforeach my $hsp ($hit-gtnextHSP)
print tstart $hsp-gthit-gtstart() end $hsp-gthit-gtend()nprint tscore $hsp-gtscore() n
You can also load a report from a file or from standard input
my $blast_report = BioToolsBPpsilite-gtnew(-fh=gtFH)
Example 46 PSI-Blast SearchIO class
There is also aSearchIO class for PSI-blast report but it does not give access to the specific iterations
use BioSearchIO
my $in = BioSearchIO-gtnew( -format =gt rsquopsiblastrsquo-file =gt $ARGV[0] )
while ( my $result = $in-gtnext_result() ) foreach my $hit ( $result-gthits )
37
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 45: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/45.jpg)
Chapter 4 Analysis
print Hit $hitn
Exercise 420 Parse a PSI-blast report
Load a PSI-blast report (example [datapsiblastout] ) and at each iteration display
bull the hits that were not present at the previous iteration
bull the hits that have disappeared
Solution A25
415 bl2seq Blast 2 sequences
Example 47 Running bl2seq
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id
38
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 46: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/46.jpg)
Chapter 4 Analysis
-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
Exercise 421 Build a BioSimpleAlign object
Build a BioSimpleAlign object from the subject and query sequences and display this alignment inClustalw formatSolution A26
416 Blast Internal classes structure
39
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 47: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/47.jpg)
Chapter 4 Analysis
The following diagram shows the internal classes structure It is provided for information and it has not to beunderstood to use bioperl Blast classes
Figure 43 Blast internal classes diagram
40
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 48: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/48.jpg)
Chapter 4 Analysis
42 Genscan
Example 48 Genscan parsing
The folowing example shows how to use the BioToolsGenscan [httpdocbioperlorgreleasesbioperl-10BioToolsGenscanhtml] parser (data Genscan output file [datagenscanout])
use BioToolsGenscan
my $genscan_file = $ARGV[0]$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
while(my $gene = $genscan-gtnext_prediction()) my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
Example 49 Genscan parsing with sub-sequences
(data Genscan output with CDS [datagenscan-cdsout])
Genscan example with sub-sequences
use BioToolsGenscan
my $genscan_file = $ARGV[0]
$genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)while(my $gene = $genscan-gtnext_prediction())
my $prot = $gene-gtpredicted_proteinprint protein ( ref($prot) ) n $prot-gtseq nn
display genscan predicted cds (if -cds genscan option)my $predicted_cds = $gene-gtpredicted_cdsprint predicted CDS n $predicted_cds-gtseq nn
foreach my $exon ($gene-gtexons()) my $loc = $exon-gtlocationprint exon - primary_tag $exon-gtprimary_tag coding $exon-gtis_coding() start-end $loc-gtstart - $loc-gtend n
foreach my $intron ($gene-gtintrons()) my $loc = $intron-gtlocation
41
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 49: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/49.jpg)
Chapter 4 Analysis
print intron primary_tag $intron-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
foreach my $utr ($gene-gtutrs()) my $loc = $utr-gtlocationprint utr primary_tag $utr-gtprimary_tag start-end $loc-gtstart - $loc-gtend n
print ---------------------------------n
42
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 50: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/50.jpg)
Chapter 4 Analysis
Figure 44 Genscan Classes Structure
43
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 51: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/51.jpg)
Chapter 4 Analysis
Exercise 422 Code reading BioToolsGenscan module
1What is the purpose of theBioSeqAnalysisParserI class
2Look at the_parse_predictions method inBioToolsGenscan
3What happens during the creation of aBioToolsGenscan instance which methods are called atwhich class hierarchy level
44
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 52: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/52.jpg)
Chapter 5 Databases
Chapter 5 Databases
51 Database classes
Example 51 Database class use
use BioIndexSwissprotuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT -format =gt rsquofastarsquo)my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)print $out $seq
Example 52 Database Index creation
(data a SwissProt file to index [datasmall_swissdat] )
use BioIndexSwissprot
my $Index_File_Name = shiftmy $inx = BioIndexSwissprot-gtnew( -filename =gt $Index_File_Name
-write_flag =gt 1)$inx-gtmake_index(ARGV)
45
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 53: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/53.jpg)
Chapter 5 Databases
Figure 51 Database Classes structure
Exercise 51 Parse a Genscan report and build a database entry
Parse a Genscan report on a part of Arabidopsis chromosome V (data Genscan report [datagenscan-arabout])
bull (a) construct a database in Embl format
46
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 54: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/54.jpg)
Chapter 5 Databases
bull
bull containing an entry for each Genscan predicted CDS
bull put this CDS exons as Features in the current entry
bull (b) index this Embl formatted database
bull (c) use your indexed database
Outline of the exercise
Solution A27
47
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 55: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/55.jpg)
Chapter 5 Databases
Exercise 52 Parse a Genscan report and build a database entry with the genomicsequence
Starting from exercise Exercise 51 add the DNA genomic sequence as sequence of the entry (+- delta at thebeginning and at the end) Example
SQ Sequence 2448 BP 897 A 487 C 411 G 653 T 0 otherttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat 60gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct 120
(data Arabidopsis sequence - part of the chromosome V [dataarabidop-chr5-1quartfasta])
bull build an additional feature describing all the CDS
bull put the translation as a tag for this CDS (see add_tag_value [httpdocbioperlorgreleasesbioperl-10BioSeqFeatureGenerichtmlPOD14]) which yields for instance
FT CDS join(complement(100171)complement(284358)FT complement(457492)complement(626673)FT complement(14601511)complement(24732513)FT complement(26382716)complement(28132969)FT complement(30463178)complement(32633390)FT complement(34763635))FT translation =MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLLFT TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSEFT DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSERFT NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLLFT DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNFFT STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN
bull regarding the location of the feature see Section 234 in this tutorial as well as add_sub_Location[httpdocbioperlorgreleasesbioperl-10BioLocationSplithtmlPOD2]
Solution A28
52 Accessing a local database with golden
Exercise 53 Build a small bioperl module (for the golden program)
Idea we want a bioperl module for local databases access We could use bioperl indexes except that this is notefficient for large databases (indexing time and indexes size) All these databases are already indexed by golden[ftpftppasteurfrpubGenSoftunixdb_softgolden] in a very efficient way The module could be designed to beused like this
48
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 56: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/56.jpg)
Chapter 5 Databases
use LocalDBAccessuse BioSeqIO
my $seq = LocalDBAccess-gtget_Seq_by_id ( $ARGV[0] )my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquofastarsquo )
print $out $seq
Solution A29
49
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 57: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/57.jpg)
Chapter 5 Databases
50
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 58: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/58.jpg)
Chapter 6 Perl Reminders
Chapter 6 Perl RemindersThese reminders may be helpful to use modules [use] or to a more advanced understanding [advanced] of them
61 UML
Figure 61 UML meanings
62 Perl reminders to use bioperl modules
621 References
bull to pass a hash or a function as a parameter to a function
$blastObj = BioToolsBlast-gtnew( -run =gt runParam-parse =gt 1
-filt_func =gt ampmy_filter)
bull for an output parameter
blastParam = (-run =gt runParam-save_array =gt blast_objs )
51
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 59: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/59.jpg)
Chapter 6 Perl Reminders
bull to pass a filehandle as a parameter
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
bull references on an anonymous array
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt [ $seq ]) BioSeqpm objects
which is equivalent to
runParam = ( -method =gt rsquoremotersquo-prog =gt rsquoblastprsquo-database =gt rsquoswissprotrsquo-seqs =gt seqs) BioSeqpm objects
with a variable
bull references on anonymous data structures
(reminder) anonymous hasha = (a =gt 1 b =gt 2 )print hash ref(a) $aa $ab n
reference on an anonymous hash$ra = a =gt 1 b =gt 2 print ref hash ref($ra) $ra-gta $ra-gtb n
reference on an anonymous array$rl = [1 2]print ref array ref($rl) $rl-gt[0] $rl-gt[1] n
See for instance in exercises 24 et 27ref() returns the type of the variable (or its classsee below)
622 Filehandles and streams
bull A data stream is for instance an open file Streams are available through filehandles (file descriptors) - inperl a symbol without a$ such asINFILE or OUTFILE Standard input and standard output streams areassociated by default to theSTDIN andSTDOUTdescriptors The filehandle is associated to the stream whenopening the file
52
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 60: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/60.jpg)
Chapter 6 Perl Reminders
open (INFILE $infile) || die cannot open $infile$open (OUTFILE gt $outfile) || die cannot open $outfile$$line = ltINFILEgt
read a sequence objectprint OUTFILE $sequence write a sequence object
bull In bioperl there are classes that build filehandles
$in = BioSeqIO-gtnewFh ( -file =gt rsquofrsquo) make a tied filehandle$sequence = lt$ingt read a sequence object$out = BioSeqIO-gtnewFh ( -file =gt rsquogt frsquo)print $fh $sequence write a sequence object
bull Filhandles such asINFILE or STDOUTbelong to a specific namespace and can not be used in an assignationor as a parameter For this you have to pass a reference to the typeglob
my $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquoemblrsquo)
typeglobs are in a general namespace on top of the other namespaces for scalars arrays hashes and functionsYou can use it to create aliases (see the Advanced Perl Programming book)
623 Exceptions
bull When using a bioperl module the following can happen
-------------------- EXCEPTION --------------------MSG Attempting to set the sequence to [==] which does not look healthySTACK BioPrimarySeqseq locallibperl5site_perl560BioPrimarySeqpm243STACK BioPrimarySeqnew locallibperl5site_perl560BioPrimarySeqpm218STACK BioSeqnew locallibperl5site_perl560BioSeqpm132STACK BioSeqIOfastanext_primary_seq locallibperl5site_perl560BioSeqIOfastapm130
STACK BioSeqIOfastanext_seq locallibperl5site_perl560BioSeqIOfastapm85
STACK toplevel solutionblast_featurespl22-------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval ) to handle use errors An exception is raised bythe perl module an can be caught
53
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 61: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/61.jpg)
Chapter 6 Perl Reminders
bull throw in bioperl this is the method for raising exceptions It is provided by the BioRootRootI[httpdocbioperlorgreleasesbioperl-10BioRootRootIhtml] class
sub throwmy ($self$string) = _my $std = $self-gtstack_trace_dump()my $out = -------------------- EXCEPTION --------------------nMSG $stringn$std-------------------------------------------n
die $out
bull You can catch exceptions witheval (donrsquot forget the at the end of theeval block
eval this instruction may throw an exception if parameters are incorrect$report = $factory-gtbl2seq($querySeq $subjectSeq )
if( $ )
print Caught exceptionelse
print no exception
bull In bioperl exceptions are also used to implement interfaces classes the following example inBioPrimarySeqI [httpdocbioperlorgreleasesbioperl-10BioPrimarySeqIhtml]
sub seq my ($self) = _if( $self-gtcan(rsquothrowrsquo) )
$self-gtthrow(BioPrimarySeqI definition of seq - implementing class did not provide this method)
else
confess(BioPrimarySeqI definition of seq - implementing class did not provide this method)
is a mean to oblige the subclass to implement aseq method It is thus a way to encode abstract methods
54
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 62: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/62.jpg)
Chapter 6 Perl Reminders
624 GetoptStd
bull To handle command line optionsman GetoptStd
bull Example
use GetoptStdmy opts = ()getopts(rsquoioIOrsquo opts)my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
bull Another example
use GetoptStdgetopts(rsquofplghrsquo)if ($opt_f) $format = $opt_felse $format = Genbank
625 Classes
bull Inheritance example in BioSeq [httpdocbioperlorgreleasesbioperl-10BioSeqhtml]
ISA = qw(BioRootRootI BioSeqI)
bull In bioperl you use classes the following ways
bull By instantiating a class most often with anew method (this is a coding convention - see alsonewFh [httpdocbioperlorgreleasesbioperl-10BioSeqIOhtmlnewFh]from_searchResult[httpdocbioperlorgreleasesbioperl-10BioSeqFeatureSimilarityPairhtmlfrom_searchResult] )
$seq_stats = BioToolsSeqStats-gtnew ($seqobj)
bull As a library component
55
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 63: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/63.jpg)
Chapter 6 Perl Reminders
bull Test whether a class or an objectis-a
if (BioToolsBlast-gtisa(BioRootRootI))
if ($seq-gtisa(BioSeqRichSeq))
bull find the class of an object
print class of exon ref($exon) n
626 BEGIN block
Statements to be executed when the module is loaded (eg when issuing ause Module statement) see forinstance
BioLocationI to define the default coordinate policy
BEGIN $coord_policy = BioLocationWidestCoordPolicy-gtnew()
Remark this is necessary in object-oriented style since everything is encapsulated in methods
63 Perl reminders for a further advanced understanding ofbioperl modules
631 Modules
bull Exporter [httpwwwperldoccomperl56libExporterhtml] handles module variables scope (even thoughthere are no private variables in perl)
bull EXPORT= qw() symbols exported by default
bull EXPORT_OK = qw() symbols imported on demand
56
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 64: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/64.jpg)
Chapter 6 Perl Reminders
bull EXPORT_TAGS = tag =gt [] tags for symbols sets example
use BioToolsBlast qw(obj)
imports symbols tagged asobj - here$Blast a staticBioToolsBlast object for a restrciteduse of the module methods
632 Compiler instructions
bull use strict no symbolic references no access to non local variables (local variable are declared by amystatement) no barewords
bull use vars qw(ISA valid_type) a pragmato pre-declare global variables
633 Tie
Associates a class to a variable whenever a variable is tiedread and write statements (access assignation) trigger a call topredefined subroutines (FETCH STORE PRINT READLINE) You can find an example in(BioSeqIO)
sub fh my $self = shift my $class = ref($self) || $selfmy $s = Symbolgensymtie $$s$class$self return $s
which returns a filehandle tied to the BioSeqIOFhclassAs explained in Tying-FileHandles BioSeqIO classhas thus to redefine these methods
sub READLINE my $self = shiftreturn $self-gtrsquoseqiorsquo-gtnext_seq() unless wantarraymy (list $obj)push list $obj while $obj = $self-gtrsquoseqiorsquo-gtnext_seq()return list
sub PRINT my $self = shift $self-gtrsquoseqiorsquo-gtwrite_seq(_)
These methods enable the programmer to read and print through the variable
57
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 65: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/65.jpg)
Chapter 6 Perl Reminders
$fh = $obj-gtfh make a tied filehandle$sequence = lt$fhgt read a sequence object (calls READLINE)print $fh $sequence write a sequence object (calls PRINT)
58
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 66: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/66.jpg)
Appendix A Solutions
Appendix A Solutions
A1 Sequences
Solution A1 BioSeqIO structure
Exercise 21
59
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 67: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/67.jpg)
Appendix A Solutions
Figure A1 BioSeqIO structure
Solution A2 An universal converter
Exercise 22
UnivConvert via a file
60
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 68: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/68.jpg)
Appendix A Solutions
localbinperl -w
exercise 11 UnivConvert via file
use strict
use BioSeqIO
my $inform = shift ARGVmy $outform = shift ARGV
my $infile = shift ARGVmy $outfile = shift ARGVif ( defined $outfile )
$outfile = $infile$outfile =~ s$ if $outfile =~ $outfile = $outform
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl UnivConvert2pl swiss gcg seqssp
UnivConvert via stream
localbinperl -w
exercise 11 UnivConvert via flux
use strict
use BioSeqIO
my $inform = shift ARGV || rsquoswissrsquomy $outform = shift ARGV || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT
61
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 69: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/69.jpg)
Appendix A Solutions
-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
cat seqssp | perl solutionUnivConvertStdminpl swiss gcg
UnivConvert with options
localbinperl -w
exercice 11 UnivConvert avec options
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquoioIOrsquo opts)
my $infile = $optsrsquoIrsquo || rsquoinfilersquomy $outfile = $optsrsquoOrsquo || rsquooutfilersquomy $inform = $optsrsquoirsquo || rsquoswissrsquomy $outform = $optsrsquoorsquo || rsquofastarsquo
my $in = BioSeqIO-gtnewFh ( -file =gt $infile-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -file =gt gt$outfile-format =gt $outform )
print $out $_ while lt$ingt
You can use it like this
perl solutionUnivConvertpl -I seqssp -O seqsfasta
UnivConvert with options and stream
localbinperl -w
62
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 70: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/70.jpg)
Appendix A Solutions
exercise 11 UnivConvert with option and stream
use strictuse GetoptStd
use BioSeqIO
my opts = ()getopts(rsquohiorsquo opts)
usage() if (exists $optsrsquohrsquo)
my $inform = lc($optsrsquoirsquo) || rsquoswissrsquomy $outform = lc($optsrsquoorsquo) || rsquofastarsquo
(format_ok($inform) ampamp format_ok($outform)) || exit 1
my $in = BioSeqIO-gtnewFh ( -fh =gt STDIN-format =gt $inform )
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt $outform )
print $out $_ while lt$ingt
sub format_ok
my $form = shift _
my formats = qw(fasta 1 swiss 1 embl 1 genbank 1scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1phd 1 qual 1)
if ( exists $formats$form) print STDERR $form -- unknown formatnusage ()return 0 false
return 1 true
sub usage
my $prog =lsquobasename $0lsquo chomp($prog)
print nprint $prog convert files from one format into another formatnprint n
63
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 71: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/71.jpg)
Appendix A Solutions
print $prog [-i informat] [-o outformat] lt infilenprint nprint -i informat -- infile format (default rsquoswissrsquo)nprint -o outformat -- outfile format (default rsquofastarsquo)nprint nprint -h -- print this messagenprint nprint Available formats fasta swiss EMBL GenBank SCF Pir GCG raw et ACEn
print n
exit 1
You can use it like this
cat seqssp | perl solutionUnivConvertStdpl -o gcg gt seqsgcg
Solution A3 Display a sequence in fasta format
Exercise 23
rest de lrsquoannotationmy $annotation ampequals $seq-ampgtannotation()
my structures ampequals ()foreach my $link ( $annotation-ampgteach_DBLink() )
if ($link-ampgtdatabase() eq rsquoPDBrsquo) push (structures $link-ampgtprimary_id())
Solution A4 More on annotations
Exercise 24
get the protein function -- exomy $function = foreach my $comment ( $annotation-gtget_Annotations(rsquocommentrsquo) )
my lines = split (n $comment-gttext())foreach my $line (lines)
if ( $line =~ s^-- FUNCTION ) $line =~ sn g$function = $line
64
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 72: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/72.jpg)
Appendix A Solutions
last
last if $function ne
print nFunction $function n
Solution A5 Transmembran helices
Exercise 25
almost for GenBankEMBL entriesmy features = $seq-gttop_SeqFeatures()
foreach my $feat (features) if ($feat-gtprimary_tag() eq rsquoTRANSMEMrsquo)
print TM Segment from $feat-gtstart() to $feat-gtend() n
A2 Alignments
Solution A6 Create an alignment without gaps
Exercise 31
localbinperl -w
Create a new alignment without gaps at the begining and the end of an alignment
use strictuse BioAlignIO
get alignment
my $inif ( $ARGV gt -1 )
$in = new BioAlignIO ( -file =gt $ARGV[0] -format =gt rsquoclustalwrsquo )else
$in = new BioAlignIO ( -fh =gt STDIN -format =gt rsquoclustalwrsquo )my $out = newFh BioAlignIO ( -fh =gt STDOUT -format =gt rsquoclustalwrsquo )
65
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 73: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/73.jpg)
Appendix A Solutions
my $aln = $in-gtnext_aln()
find max column index of starting gaps and min column index of ending gaps
my $startgaps = 0my $endgaps = 0my $gap_char = $aln-gtgap_char()foreach my $seq ($aln-gteach_seq)
my $str = $seq-gtseq()$str =~^(Q$gap_charE)my $len = length($1)$startgaps = ($startgaps gt $len) $startgaps $len$str =~(-)$$len = length($1)$endgaps = ($endgaps gt $len) $endgaps $len
cut the starting and ending block containing gaps
print $out $aln-gtslice($startgaps+1 $aln-gtlength()-$endgaps)
A3 Analysis
Solution A7 Running Blast on a Swissprot entry
Exercise 41
use BioSeqIOuse BioToolsRunStandAloneBlast
a) solution if you have a local program to fetch database entries (ftpftppasteurfrpubGenSoftunixdb_softgolden)
my $Seq_in = BioSeqIO-gtnew (-file =gt rsquogolden spTAUD_ECOLI |rsquo -format =gt rsquoswissrsquo)my $query = $Seq_in-gtnext_seq()
b) else
use BioDBSwissProtmy $database = new BioDBSwissProtmy $query = $database-gtget_Seq_by_id(rsquoTAUD_ECOLIrsquo)
my $factory = BioToolsRunStandAloneBlast-gtnew(
66
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 74: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/74.jpg)
Appendix A Solutions
rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A8 Running Blast Setting parameters
Exercise 42
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
$factory-gte(10)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A9 Running Blast Saving output
Exercise 43
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo
67
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 75: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/75.jpg)
Appendix A Solutions
rsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast
)$factory-gtoutfile(rsquoblastoutrsquo)my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() significance $hit-gtsignificance() n
Solution A10 Running a remote Blast
Exercise 44
use BioSeqIOuse BioToolsRunRemoteBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunRemoteBlast-gtnew(rsquo-progrsquo =gt rsquoblastprsquorsquo-datarsquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtsubmit_blast($query)
while ( my rids = $factory-gteach_rid )
print STDERR waitingn
RID = Remote Blast ID (eg 1017772174-16400-6638)foreach my $rid ( rids )
my $rc = $factory-gtretrieve_blast($rid)if( ref($rc) )
if( $rc lt 0 ) retrieve_blast returns -1 on error$factory-gtremove_rid($rid)
retrieve_blast returns 0 on rsquojob not finishedrsquosleep 5
else
---- Blast done ----$factory-gtremove_rid($rid)
68
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 76: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/76.jpg)
Appendix A Solutions
my $result = $rc-gtnext_resultprint database $result-gtdatabase_name() nwhile( my $hit = $result-gtnext_hit )
print hit name is $hit-gtname nwhile( my $hsp = $hit-gtnext_hsp )
print score is $hsp-gtscore n
Solution A11 Display Blast hits
Exercise 45
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit accession $hit-gtaccession() nwhile( my $hsp = $hit-gtnext_hsp())
print length $hsp-gtlength() n
Solution A12 Class of a Blast report
Exercise 46
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
69
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 77: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/77.jpg)
Appendix A Solutions
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo
_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall($query)print Class of blast_report is ref($blast_report) n
Solution A13 Parse Blast results on standard input
Exercise 48
use BioSearchIOmy $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-fhrsquo =gt STDIN)
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
One way to use this code is by feeding ablastall output directly to the script through a Unix pipe
blastall -p blastp -d swissprot -i data1protfasta | perl solutionblast_parse_stdinpl
Solution A14 Filtering hits by length
Exercise 49
use BioSearchIOmy $max_length = ($ARGV[1]) $ARGV[1] 200my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
if ($hit-gtlength() gt= $max_length) print thit name $hit-gtname() hit length $hit-gtlength() nwhile( my $hsp = $hit-gtnext_hsp())
70
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 78: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/78.jpg)
Appendix A Solutions
print E $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A15 Filtering hits by position
Exercise 410
use BioSearchIO
my $min_start = ($ARGV[1]) $ARGV[1] 1my $max_end = ($ARGV[2]) $ARGV[2] -1
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtstart() gt= $min_start ampamp($max_end == -1 || ($hsp-gtend() lt= $max_end))) print ttstart $hsp-gtstart() end $hsp-gtend() nprint ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A16 Display the best hit by databank
Exercise 411
use BioSearchIO
sub dbname my ($name) = _$db= $name$db =~ (w+)|$db = $1return $db
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquo
71
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 79: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/79.jpg)
Appendix A Solutions
rsquo-filersquo =gt $ARGV[0])my $result = $blast_report-gtnext_result
while( my $hit = $result-gtnext_hit()) my $dbname = dbname($hit-gtname())if ($done$dbname)
next else
$done$dbname = 1print thit name $hit-gtname() nwhile( my $hsp = $hit-gtnext_hsp())
print ttE $hsp-gtevalue() frac_identical $hsp-gtfrac_identical() n
Solution A17 Multiple queries
Exercise 412
use BioSeqIOuse BioToolsRunStandAloneBlast
my query_seqs = ()my $Seq1_in = BioSeqIO-gtnew (-file =gt $ARGV[0]
-format =gt rsquofastarsquo)push (query_seqs $Seq1_in-gtnext_seq())
my $Seq2_in = BioSeqIO-gtnew (-file =gt $ARGV[1]-format =gt rsquofastarsquo)
push (query_seqs $Seq2_in-gtnext_seq())
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo_READMETHOD =gt Blast)
my $blast_report = $factory-gtblastall(query_seqs)while (my $result = $blast_report-gtnext_result())
while( my $hit = $result-gtnext_hit()) print thit name $hit-gtname() significance $hit-gtsignificance() nlast
72
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 80: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/80.jpg)
Appendix A Solutions
Solution A18 Extracting the subject sequence
Exercise 413
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A19 Extracting alignments
Exercise 414
use BioSearchIO
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[0])
my $result = $blast_report-gtnext_resultmy $level = $ARGV[1]
while( my $hit = $result-gtnext_hit()) while( my $hsp = $hit-gtnext_hsp())
if ($hsp-gtfrac_identical() gt= $level) print $hsp-gthit_string n
Solution A20 Locate EST in a genome
Exercise 415
use BioSeqIOuse BioSearchIOuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
73
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 81: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/81.jpg)
Appendix A Solutions
my $seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $seq = $seq_in-gtnext_seq()my $contig_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-gtseq-id =gt $seq-gtdisplay_id-start =gt 1-end =gt $seq-gtlength)
my $aln = BioSimpleAlign-gtnew()$aln-gtadd_seq($contig_seq)
my $blast_report = new BioSearchIO (rsquo-formatrsquo =gt rsquoblastrsquorsquo-filersquo =gt $ARGV[1])
my $result = $blast_report-gtnext_resultwhile( my $hit = $result-gtnext_hit())
my $i = 0while( my $hsp = $hit-gtnext_hsp())
$i++my $start = $hsp-gthit-gtstart()my $end = $hsp-gthit-gtend()my $name = $result-gtquery_name _HSP_$imy $seq = fill_gaps($hsp-gtquery_string $start $aln-gtlength)my $query_seq = BioLocatableSeq-gtnew(
-seq =gt $seq-id =gt $name-start =gt $start-end =gt $end)
$aln-gtadd_seq($query_seq)
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )print $out $aln
sub fill_gaps add gaps from 0 to start and from start + seq length to endmy ($seq $start $length) = _my $gapped_seq = for (my $i = 0 $i lt $start - 1 $i++)
$gapped_seq = -$gapped_seq = $seqfor (my $i = $start + length($seq) $i lt= $length $i++)
$gapped_seq = -return $gapped_seq
74
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 82: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/82.jpg)
Appendix A Solutions
Example of use
perl solutionblast_locatepl datacontig datablast-estout
Solution A21 Record Blast hits as sequence features
Exercise 416
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioSequse BioSeqFeatureSimilarityPair
my $seqfile = shift ARGVmy $in = BioSeqIO-gtnew (-file =gt $seqfile -format =gt rsquofastarsquo)my $query = $in-gtnext_seq()my dbs = ARGV
foreach my $db (dbs) my $factory = BioToolsRunStandAloneBlast-gtnew(database =gt $db
program =gt rsquoblastprsquo)
my $report = $factory-gtblastall($query)
while(my $sbjct = $report-gtnextSbjct) print STDERR name $sbjct-gtname nwhile (my $hsp = $sbjct-gtnextHSP)
print STDERR GFFn $hsp-gtquery()-gtgff_string() n$query-gtadd_SeqFeature($hsp)
last
foreach my $feat ($query-gtall_SeqFeatures()) print STDERR Feature from $feat-gtstart to $feat-gtend Primary tag $feat-gtprimary_tag produced by $feat-gtsource_tag() n$feat-gtprimary_tag(BLAST)
$out = BioSeqIO-gtnewFh(-fh =gt STDOUT rsquo-formatrsquo =gt rsquoswissrsquo)print $out $query
75
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 83: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/83.jpg)
Appendix A Solutions
Solution A22 Print informations from a BPLite report
Exercise 417
use BioSeqIOuse BioToolsRunStandAloneBlast
my $Seq_in = BioSeqIO-gtnew (-file =gt $ARGV[0]-format =gt rsquofastarsquo)
my $query = $Seq_in-gtnext_seq()
my $factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquorsquodatabasersquo =gt rsquoswissprotrsquo)
my $blast_report = $factory-gtblastall($query)while (my $subject = $blast_report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtpercent$hsp-gtscore) n
Solution A23 Create a BioToolsBPlite from a file
Exercise 418
use BioToolsBPlite
my $report = new BioToolsBPlite(-file =gt $ARGV[0])while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
76
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 84: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/84.jpg)
Appendix A Solutions
Solution A24 Create a BioToolsBPlite from standard input
Exercise 419
use BioToolsBPlite
my $report = new BioToolsBPlite(-fh =gt STDIN)while (my $subject = $report-gtnextSbjct())
print $subject-gtname() nwhile (my $hsp = $subject-gtnextHSP())
print join(t$hsp-gtP$hsp-gtscore$hsp-gtlength$hsp-gtmatch$hsp-gtpositive) n
Solution A25 Parse a PSI-blast report
Exercise 420
use BioToolsBPpsilite
my $blast_report = new BioToolsBPpsilite (rsquo-filersquo =gt $ARGV[0])
for my $iteration (1 $blast_report-gtnumber_of_iterations()) print n-----------------------------niteration $iterationnmy $result = $blast_report-gtround($iteration)my $hitarray_ref = $result-gtnewhitsforeach my $hit ( $hitarray_ref )
print NEW hit name $hitnmy $oldhitarray_ref = $result-gtoldhitsforeach my $hit (previous_hitarray)
if (grep Q$hitE $oldhitarray_ref ) next
print DISAPEARED hit name $hitn
push (previous_hitarray $oldhitarray_ref )push (previous_hitarray $hitarray_ref )
77
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 85: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/85.jpg)
Appendix A Solutions
Solution A26 Build a BioSimpleAlign object
Exercise 421
use BioSeqIOuse BioToolsRunStandAloneBlastuse BioAlignIOuse BioSimpleAlignuse BioLocatableSeq
my $query1_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[0]-format =gt rsquofastarsquo )
my $query1 = lt$query1_ingt
my $query2_in = BioSeqIO-gtnewFh ( -file =gt $ARGV[1]-format =gt rsquofastarsquo )
my $query2 = lt$query2_ingt
my $out = BioAlignIO-gtnewFh(-format =gt rsquoclustalwrsquo )
$factory = BioToolsRunStandAloneBlast-gtnew(rsquoprogramrsquo =gt rsquoblastprsquo)
$report = $factory-gtbl2seq($query1 $query2)
while(my $hsp = $report-gtnext_feature) my $aln = BioSimpleAlign-gtnew()my $querySeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtqs-id =gt $query1-gtdisplay_id-start =gt 1-end =gt $query1-gtlength
)my $sbjctSeq = BioLocatableSeq-gtnew(
-seq =gt $hsp-gtss-id =gt $report-gtsbjctName-start =gt 1-end =gt $query2-gtlength)
$aln-gtadd_seq($querySeq)$aln-gtadd_seq($sbjctSeq)print $out $aln
A4 Databases
78
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 86: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/86.jpg)
Appendix A Solutions
Solution A27
Exercise 51
bull (a) database construction
localbinperl -w
(a) Genscan build a database containing one entry per CDS
use strict
use BioToolsGenscanuse BioSeqIO
my $genscan_file = shift ARGV
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
date of constructionmy $date = gmtime
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquomRNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cds$cds-gtid() =~ |(+)|my $id = $1 bug$cds-gtid($id) add cds as sequence to the entry$entry-gtprimary_seq($cds)
annotation add all detected exons as SeqFeature to the entryforeach my $exon ($gene-gtexons())
$entry-gtadd_SeqFeature($exon)
79
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 87: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/87.jpg)
Appendix A Solutions
add all detected UTRrsquos as SeqFeature to the entryforeach my $utr ($gene-gtutrs())
$entry-gtadd_SeqFeature($utr)
print the entry to the database flatfile in embl formatprint $banque $entry
You can use this script like this
perl solutionbuild_dbpl datagenscan-arabout gt mydbembl
bull (b) make index
usrbinperl -w
(b) Genscan create the index
use strictuse BioIndexEMBL
my $Index_File_Name = shift
my $inx = BioIndexEMBL-gtnew( -filename =gt $Index_File_Name-write_flag =gt 1)
$inx-gtmake_index(ARGV)
You can use this script like this
perl solutionmake_indexpl mydbidx mydbembl
80
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 88: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/88.jpg)
Appendix A Solutions
bull (c) use index
usrbinperl -w
(c) Genscan retrieve some files
use strictuse BioIndexEMBLuse BioSeqIO
my $out = BioSeqIO-gtnewFh ( -fh =gt STDOUT-format =gt rsquofastarsquo)
my $Index_File_Name = shiftmy $inx = BioIndexEMBL-gtnew($Index_File_Name)
foreach my $id (ARGV) my $seq = $inx-gtfetch($id)if (defined $seq) print $out $seq else print STDERR no entry with id $id in this banquen
You can use this script like this
perl solutionuse_indexpl mydbidx GENSCAN_predicted_CDS_3
since entries names follows the pattern GENSCAN_predicted_CDS_xxx
Solution A28 Parse a Genscan report and build a database entry with the genomicsequence
Exercise 52
localbinperl -w
Genscan build a database - 1 entry per gene with genomic DNA + 1 feature for the whole CDS + 1 feature per exon
use strict
use BioToolsGenscanuse BioSeqIO
81
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 89: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/89.jpg)
Appendix A Solutions
my $genscan_file = shift ARGV
my $seqin = BioSeqIO-gtnew ( -fh =gt STDIN-format =gt rsquofastarsquo)
sequence submitted to genscanmy $seq = $seqin-gtnext_seq()
genscan reportmy $genscan = BioToolsGenscan-gtnew(-file =gt $genscan_file)
database to constructmy $banque = BioSeqIO-gtnewFh ( -fh =gt STDOUT
-format =gt rsquoemblrsquo)
+- delta of the subsequence for each entry in databasemy $delta = 100 date of constructionmy $date = gmtime
print $banque $seq
while(my $gene = $genscan-gtnext_prediction())
create entry object as database sequencemy $entry = BioSeqRichSeq-gtnew()$entry-gtadd_date($date)$entry-gtmolecule(rsquoDNArsquo)$entry-gtdivision(rsquoPLNrsquo)
extract id of genscan predictionmy $cds = $gene-gtpredicted_cdsmy ids = split(| $cds-gtid())
annotationmy $start = 0my $end = 0my $locmy $first = 1my $newloc = new BioLocationSplit()my $strand = 1my $nocds = 0foreach my $exon ($gene-gtexons())
$loc = $exon-gtlocation()
if ($first) $first=0 $start=$loc-gtstart-$delta if ($start lt 0) $start = 0
$loc-gtstart($loc-gtstart-$start)$loc-gtend ($loc-gtend-$start)
82
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 90: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/90.jpg)
Appendix A Solutions
$exon-gtadd_tag_value(rsquosourcersquo rsquogenscanrsquo)$exon-gtprimary_tag =~ m()Exon$exon-gtprimary_tag(rsquoExonrsquo)$exon-gtadd_tag_value(lc($1) 1)
add all detected exons as SeqFeature to the entry$entry-gtadd_SeqFeature($exon)
construct location for CDSif ($strand == -1 ampamp $loc-gtstrand == 1)
$newloc-gtwarn(exons are coded on different strands)$nocds = 1
if ($loc-gtstrand == -1)
$loc-gtstrand(1)$newloc-gtstrand(-1)
$newloc-gtadd_sub_Location($loc)
$end = $loc-gtend+$deltaif ($end gt $seq-gtlength()) $end = $seq-gtlength()
if ( $nocds) contruct a SeqFeature CDS for the entry ( join all exons)my $newcds = BioSeqFeatureGeneric-gtnew ( -primary =gt rsquoCDSrsquo
-location =gt $newloc)
add the translation to the CDS Feature$newcds-gtadd_tag_value (rsquotranslationrsquo $gene-gtpredicted_protein()-gtseq())
add the CDS as Feature to the entry$entry-gtadd_SeqFeature($newcds)
construct the sequence for the entry subsequence of the sequence submitted to genscan +- delta at each sidemy $seqstrif ( $start gt $end ) $seqstr = $seq-gtsubseq($end $start) else $seqstr = $seq-gtsubseq($start $end)
my $newseq = BioPrimarySeq-gtnew ( -seq =gt $seqstr-id =gt $ids[1] -moltype =gt rsquodnarsquo)
$entry-gtprimary_seq($newseq)
print the entry to the database flatfile in embl formatprint $banque $entry
83
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 91: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/91.jpg)
Appendix A Solutions
Solution A29 Build a small bioperl module (for the golden program)
Exercise 53
Build a small bioperl module
package LocalDBAccess
use strictuse vars qw(ISA AVAIL_LOCALDB)
use BioDBRandomAccessIuse BioSeqIOuse BioSeq
BEGIN
AVAIL_LOCALDB = ( sprot =gt rsquoswissrsquosptrnrdb =gt rsquoswissrsquogb =gt rsquogenbankrsquoembl =gt rsquoemblrsquo )
ISA = qw(BioDBRandomAccessI)
sub get_Seq_by_idmy ($self args) = _
return $self-gtget_Seq( args rsquo-irsquo)
sub get_Seq_by_accmy ($self args) = _
return $self-gtget_Seq(args rsquo-arsquo)
sub get_Seq
my ($self $entry $access) = _
84
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 92: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/92.jpg)
Appendix A Solutions
my ($db $id) = split(rsquorsquo $entry)
if ($access ~ ^-[ia]$) $access =
if ($id) $self-gtthrow (no database specified)
if (_has_local_DB($db))
$self-gtthrow ($db -- no such database available)
my $in = BioSeqIO-gtnew( -file =gt golden $access $entry 2gt devnull |-format =gt _get_seq_format($db))
my $seq = $in-gtnext_seq()
$in-gtDESTROY() if ($ gtgt 8) $self-gtthrow ($id -- no such entry in database $db)
$self-gtthrow ($id -- no such entry in database $db) if ( defined $seq)
return $seq
sub _has_local_DB
my ($db) = _
if (exists $AVAIL_LOCALDB$db) return 1 return 0
sub _get_seq_format
my ($db) = _
return $AVAIL_LOCALDB$db
85
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-
![Page 93: Bio Perl](https://reader033.fdocuments.in/reader033/viewer/2022051622/55cf9aba550346d033a31a71/html5/thumbnails/93.jpg)
Appendix A Solutions
86
- Bioperl course
-
- Chapter 1 General introduction
-
- 11 Bioperl documentation
- 12 General bioperl classes
-
- Chapter 2 Sequences
-
- 21 The BioSeqIO class
- 22 Sequence classes
- 23 Features and Location classes
- 24 Sequence analysis tools
-
- Chapter 3 Alignments
-
- 31 AlignIO
- 32 SimpleAlign
- 33 Code reading protal2dna
-
- Chapter 4 Analysis
-
- 41 Blast
- 42 Genscan
-
- Chapter 5 Databases
-
- 51 Database classes
- 52 Accessing a local database with golden
-
- Chapter 6 Perl Reminders
-
- 61 UML
- 62 Perl reminders to use bioperl modules
- 63 Perl reminders for a further advanced understanding of bioperl modules
-
- Appendix A Solutions
-
- A1 Sequences
- A2 Alignments
- A3 Analysis
- A4 Databases
-