BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple...
-
Upload
crystal-robbins -
Category
Documents
-
view
219 -
download
1
Transcript of BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple...
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 1
Topics
Advanced SQL for Multiple Table Databases JOINS
Objected Oriented Perl BioPerl
PDB BLAST
Some of the slides are based on previous material by J. Grefenstette
Multiple Table DatabasesSuppose we want to store info such as:+----------+----------+----------------+-----------+------------+| name | city | country | city size | population |+----------+----------+----------------+-----------+------------+| Wu | Shanghai | China | 17500000 | 1321851888 || Lakshmi | Bombay | India | 21600000 | 1129866154 || Prabhu | Delhi | India | 21500000 | 1129866154 || Susan | New York | United States | 21900000 | 301139947 || Richard | Chicago | United States | 9800000 | 301139947 || Robert | London | United Kingdom | 12000000 | 60776238 || Ali | London | United Kingdom | 12000000 | 60776238 || Sohyoung | Seoul | South Korea | 23400000 | 49044790 |+----------+----------+----------------+-----------+------------+Problem: When population of city or country changes, we have to update many records
data may become inconsistent if we update, say, the city size for one person!
Solution: Design Database so that information is only stored in one place
called Normalizing the Database Uses several tables to eliminate many-to-one relationships within a table Queries can join several tables together
JOINS
2BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)
Multiple Table DatabasesThe PERSON table:
mysql> select * from person;+----+----------+----------+| id | name | city |+----+----------+----------+| 1 | Susan | New York || 2 | Robert | London || 3 | Ali | London || 4 | Richard | Chicago || 5 | Wu | Shanghai || 6 | Lakshmi | Bombay || 7 | Sohyoung | Seoul || 8 | Prabhu | Delhi |+----+----------+----------+8 rows in set (0.00 sec)
JOINS
3BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)
Multiple Table DatabasesThe CITY table:
mysql> select * from city limit 10;+-------------+---------------+------------+| name | country | population |+-------------+---------------+------------+| Tokyo | Japan | 33600000 || Seoul | South Korea | 23400000 || Mexico City | Mexico | 22400000 || New York | United States | 21900000 || Bombay | India | 21600000 || Delhi | India | 21500000 || Sao Paulo | Brazil | 20600000 || Los Angeles | United States | 18000000 || Shanghai | China | 17500000 || Osaka | Japan | 16700000 |+-------------+---------------+------------+10 rows in set (0.00 sec)
JOINS
4BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)
Multiple Table DatabasesThe COUNTRY table:
mysql> select * from country limit 10;+---------------+------------+| name | population |+---------------+------------+| China | 1321851888 || India | 1129866154 || United States | 301139947 || Indonesia | 234693997 || Brazil | 190010647 || Pakistan | 169270617 || Bangladesh | 150448339 || Russia | 141377752 || Nigeria | 135031164 || Japan | 127467972 |+---------------+------------+10 rows in set (0.01 sec)
JOINS
5BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 6
Wikipedia Definition of a JOIN
An SQL JOIN clause combines records from two or more tables in a database. It creates a set that can be saved as a table or used as is. A JOIN is a means for combining fields from two tables by using values common to each. ANSI standard SQL specifies four types of JOINs: INNER, OUTER, LEFT, and RIGHT. In special cases, a table (base table, view, or joined table) can JOIN to itself in a self-join.
JOINS
JOINing Two TablesTo JOIN two tables, use:SELECT columns From table1, table2 WHERE join conditions
mysql> select * from person, city where person.city = city.name;+----+----------+----------+----------+----------------+------------+| id | name | city | name | country | population |+----+----------+----------+----------+----------------+------------+| 1 | Susan | New York | New York | United States | 21900000 || 2 | Robert | London | London | United Kingdom | 12000000 || 3 | Ali | London | London | United Kingdom | 12000000 || 4 | Richard | Chicago | Chicago | United States | 9800000 || 5 | Wu | Shanghai | Shanghai | China | 17500000 || 6 | Lakshmi | Bombay | Bombay | India | 21600000 || 7 | Sohyoung | Seoul | Seoul | South Korea | 23400000 || 8 | Prabhu | Delhi | Delhi | India | 21500000 |+----+----------+----------+----------+----------------+------------+8 rows in set (0.01 sec)
JOINS
7BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)
Ambiguous Column NamesTo name a field, use table.column if column name appears
in multiple tables:
mysql> select person.name, person.city, country, populationfrom person, city where person.city = city.name;
+----------+----------+----------------+------------+| name | city | country | population |+----------+----------+----------------+------------+| Susan | New York | United States | 21900000 || Robert | London | United Kingdom | 12000000 || Ali | London | United Kingdom | 12000000 || Richard | Chicago | United States | 9800000 || Wu | Shanghai | China | 17500000 || Lakshmi | Bombay | India | 21600000 || Sohyoung | Seoul | South Korea | 23400000 || Prabhu | Delhi | India | 21500000 |+----------+----------+----------------+------------+8 rows in set (0.00 sec)
JOINS
8BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)
Aliases for Column TablesYou can provide an ALIAS for any column name:
mysql> select person.name, person.city, city.country,city.population as 'city size'from person, city where person.city = city.name;
+----------+----------+----------------+-----------+| name | city | country | city size |+----------+----------+----------------+-----------+| Susan | New York | United States | 21900000 || Robert | London | United Kingdom | 12000000 || Ali | London | United Kingdom | 12000000 || Richard | Chicago | United States | 9800000 || Wu | Shanghai | China | 17500000 || Lakshmi | Bombay | India | 21600000 || Sohyoung | Seoul | South Korea | 23400000 || Prabhu | Delhi | India | 21500000 |+----------+----------+----------------+-----------+8 rows in set (0.00 sec)
JOINS
9BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)
WHERE conditions in JOINSUse WHERE conditions to limit which records are
returned.Example: show date for people from India
mysql> select person.name, person.city, city.country, city.population as 'city size' from person, city where person.city = city.name and city.country = 'India';
+---------+--------+---------+-----------+| name | city | country | city size |+---------+--------+---------+-----------+| Lakshmi | Bombay | India | 21600000 || Prabhu | Delhi | India | 21500000 |+---------+--------+---------+-----------+2 rows in set (0.00 sec)
JOINS
10BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)
JOIN Three TablesCan JOIN three of more tables.Example: Show both country and city population for each person:
mysql> select person.name, person.city, city.country, city.population as 'city size', country.population from person, city, country where person.city = city.name and city.country = country.name;
+----------+----------+----------------+-----------+------------+| name | city | country | city size | population |+----------+----------+----------------+-----------+------------+| Susan | New York | United States | 21900000 | 301139947 || Robert | London | United Kingdom | 12000000 | 60776238 || Ali | London | United Kingdom | 12000000 | 60776238 || Richard | Chicago | United States | 9800000 | 301139947 || Wu | Shanghai | China | 17500000 | 1321851888 || Lakshmi | Bombay | India | 21600000 | 1129866154 || Sohyoung | Seoul | South Korea | 23400000 | 49044790 || Prabhu | Delhi | India | 21500000 | 1129866154 |+----------+----------+----------------+-----------+------------+8 rows in set (0.00 sec)
JOINS
11BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)
ORDER BYSame as above, by sort results by country
population:
mysql> select person.name, person.city, city.country, city.population as 'city size', country.population from person, city, country where person.city = city.name and city.country = country.nameorder by country.population desc;
+----------+----------+----------------+-----------+------------+| name | city | country | city size | population |+----------+----------+----------------+-----------+------------+| Wu | Shanghai | China | 17500000 | 1321851888 || Lakshmi | Bombay | India | 21600000 | 1129866154 || Prabhu | Delhi | India | 21500000 | 1129866154 || Susan | New York | United States | 21900000 | 301139947 || Richard | Chicago | United States | 9800000 | 301139947 || Robert | London | United Kingdom | 12000000 | 60776238 || Ali | London | United Kingdom | 12000000 | 60776238 || Sohyoung | Seoul | South Korea | 23400000 | 49044790 |+----------+----------+----------------+-----------+------------+8 rows in set (0.00 sec)
JOINS
12BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 13
BioPerl
Object Oriented Programming (OOP) Intro to BioPerl
Bio::Seq Bio::SeqIO Bio::DB Bio::Perl Others
BioPerl OOP
Perl Modules A Module is a file containing Perl code
The file should have a .pm extension
Code within a module is brought into a program with the use statement:
use moduledir::module;
where moduledir is the directory in which the module file exists. the moduledir can be omitted if module is in current directory
or in the standard Perl directories (specified by the @INC list)
You can access data and subroutines by using “full name”use myModule;
$myModule::counter = 3;
myModule::translate($dna);
BioPerl OOP
14BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)
Object-oriented programming (OOP) is a software engineering technique for modularizing code
Think about a program as a set of interacting objects
Objects have attributes data that describes the object
Objects also have actions (or methods) code that controls how an object interacts with other objects
Main ideas: Hide the details of an object’s data
so it can be changed easily (by the developer) users don’t need to know excess details
All interactions are performed via the object’s methods
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 15
Object Oriented Programming
BioPerl OOP
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 16
Basics of Perl Objects
In Procedural Programming, data and functions to operate on data that are kept separate. Example:$reverse_complement = revcom( $sequence );
In Object-Oriented Programming, both data and functions are encapsulated into the same construct (called objects). Example: $reversed_obj = $seq_obj->revcom();
$seq_obj is called an object, and revcom is called a method
In other words, the object $seq_obj “knows” how to calculate and return its own reverse complement.
BioPerl OOP
Basics of Perl Objects Objects are separated into types called classes
The class of an object defines both the data that it can hold and the methods that it knows.
A specific object that has a defined class is referred to as an instance of that class.
Perl uses the term "module" rather than "class".
An object in Perl is a special kind of hash reference.
In Bioperl the data that the object contains is stored in a single, complex hash and the object, like $seq_obj, is a reference to this hash. In addition, the methods that the object can use are also stored in this hash as particular kinds of references.
BioPerl OOP
17BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)
18
BioPerl(www.bioperl.org) A collection of modules that facilitates the development of Perl
scripts for bioinformatics applications Some Simple Procedural Functions
Bio::Perl Sequences
Bio::Seq Bio:SeqIO Bio::PrimarySeq
Multiple Sequence Alignments Bio::SimpleAlign
Access to Databases Bio::DB::GenBank
Access to Tools Bio::Tools::Blast Bio::Tools::BPlite
See Table 12.1 of Tisdall
BioPerl
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 19
# these methods return strings, and accept strings in some cases:
$seqobj->seq(); # string of sequence$seqobj->subseq(5,10); # part of the sequence as a string$seqobj->accession_number(); # when there, the accession number$seqobj->alphabet(); # one of 'dna','rna',or 'protein'$seqobj->seq_version() # when there, the version$seqobj->keywords(); # when there, the Keywords line$seqobj->length() # length$seqobj->desc(); # description$seqobj->primary_id(); # a unique id for this sequence regardless # of its display_id or accession number$seqobj->display_id(); # the human readable id of the sequence
Most of these features are included in GenBank records.
Methods on Bio::Seq Objects
BioPerl
Using Bio::Seq Each Class contains a method called new that returns an new object of the
class The new function takes arguments that initialize the data belonging to the
new object Once an object is created, you can access all the methods associated with its
class by calling its attached functions
use Bio::Seq; # module to handle sequence data
# create a new sequence object$seq_obj = Bio::Seq->new(-seq => 'ATGGGGGTGGTGGTACCCT', -id => 'human_id', -accession_number => 'AL000012', -alphabet => 'dna');
# the seq() method returns the object's sequence as a stringprint "DNA: ", $seq_obj->seq(), "\n";
# create a new sequence object contain protein sequence$protein_obj = $seq_obj->translate();print "Protein: ", $protein_obj->seq(), "\n";
OUTPUT:DNA: ATGGGGGTGGTGGTACCCTProtein: MGVVVP
BioPerl
20BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)
Using Bio::Seq and Bio::SeqIO#!/usr/bin/perl
# File: bioseqio.pl
use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO; # module to read/write many sequence formats
# create a new sequence object
my $seq_obj = Bio::Seq->new(-seq => 'ATGGGGGTGGTGGTACCCT',
-id => 'human_id',
-accession_number => 'AL000012',
-alphabet => 'dna');
# create a new FASTA file handler object (NOTE: ">" for writing)
my $seqio_obj = Bio::SeqIO->new(-file=>">myseq.fsa",'-format'=>'fasta');
# tell the file handler to print the sequence object
$seqio_obj->write_seq($seq_obj);
% cat myseq.fsa>human_idATGGGGGTGGTGGTACCCT
BioPerl
21BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)
Reading FASTA file with Bio::SeqIO#!/usr/bin/perl# File: read_fasta.pluse strict;use warnings;use Bio::Seq;use Bio::SeqIO; # module to read/write many sequence formats
my $file = "seq3.fsa"; #this is a FASTA file with 3 seqs
# create a new FASTA file handler object (no ">" for reading!)my $seqio_obj = Bio::SeqIO->new(-file=> $file, -format=>"fasta");
# read sequences from the FASTA filewhile (my $seq_obj = $seqio_obj->next_seq()) { my $seq_string = $seq_obj->seq(); print "$seq_string\n";}OUTPUT:
ATTGGGTGCGCGTGCNCCTTCC
aaaaaatcatactacatgtagggtaca
aaaaaatcatact
BioPerl
22BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)
A Sample GenBank Record - ILOCUS NM_007553 1541 bp mRNA linear ROD 07-
JAN-2002DEFINITION Mus musculus bone morphogenetic protein 2 (Bmp2), mRNA.ACCESSION NM_007553VERSION NM_007553.1 GI:6680793KEYWORDS .SOURCE house mouse. ORGANISM Mus musculus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae;
Murinae; Mus.REFERENCE 1 (bases 1 to 1541) AUTHORS Feng,J.Q., Harris,M.A., Ghosh-Choudhury,N., Feng,M.,
Mundy,G.R. and Harris,S.E. TITLE Structure and sequence of mouse bone morphogenetic protein-2
gene (BMP-2): comparison of the structures and promoter regions
of BMP-2 and BMP-4 genes JOURNAL Biochim. Biophys. Acta 1218 (2), 221-224 (1994) MEDLINE 94289485REFERENCE 2 (bases 1 to 1541) AUTHORS Heller,L.C., Li,Y., Abrams,K.L. and Rogers,M.B. TITLE Transcriptional regulation of the Bmp2 gene. Retinoic acid induction in F9 embryonal carcinoma cells and Saccharomyces cerevisiae JOURNAL J. Biol. Chem. 274 (3), 1394-1400 (1999) MEDLINE 99098878 PUBMED 9880512COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to
final NCBI review. The reference sequence was derived from
L25602.1.
FEATURES Location/Qualifiers source 1..1541 /organism="Mus musculus" /db_xref="taxon:10090" /chromosome="2" /map="2 76.1 cM" /cell_type="spleen" gene 1..1541 /gene="Bmp2" /db_xref="LocusID:12156" /db_xref="MGD:88177" CDS 357..1541 /gene="Bmp2" /codon_start=1 /db_xref="LocusID:12156" /db_xref="MGD:88177" /product="bone morphogenetic protein 2" /protein_id="NP_031579.1" /db_xref="GI:6680794"
/translation="MVAGTRCLLVLLLPQVLLGGAAGLIPELGRKKFAAASSRPLSRP
SEDVLSEFELRLLSMFGLKQRPTPSKDVVVPPYMLDLYRRHSGQPGAPAPDHRLERAA
SRANTVRTFHQLEAVEELPEMSGKTARRFFFNLSSVPSDEFLTSAELQIFREQIQEAL
GNSSFQHRINIYEIIKPAAANLKFPVTRLLDTRLVNQNTSQWESFDVTPAVMRWTTQG
HTNHGFVVEVAHLEENPGVSKRHVRISRSLHQDEHSWSQIRPLLVTFGHDGKGHPLHK
REKRQAKHKQRKRLKSSCKRHPLYVDFSDVGWNDWIVAPPGYHAFYCHGECPFPLADH
LNSTNHAIVQTLVNSVNSKIPKACCVPTELSAISMLYLDENEKVVLKNYQDMVVEGCG CR"
BioPerl
23BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 24
A Sample GenBank Record - II
misc_feature 492..1154 /note="TGFb_propeptide; Region: TGF-beta
propeptide" misc_feature 1236..1538 /note="TGFB; Region: Transforming growth
factor-beta (TGF-beta) family" misc_feature 1236..1538 /note="TGF-beta; Region: Transforming
growth factor beta like domain"BASE COUNT 361 a 444 c 414 g 322 tORIGIN 1 cgcactcctc cccctgctcg aggctgtgtg tcagcacttg
gctggagact tcttgaactt 61 gccgggagag tgacttgggc tccccacttc gcgccggtgt
cctcgcccgg cggatccagt 121 cttgccgcct ccagcccgat cacctctctt cctcagcccg
ctggcccacc ccaagacaca 181 gttccctaca gggagaacac ccggagaagg aggaggaggc
gaagaaaagc aacagaagcc 241 cagttgctgc tccaggtccc tcggacagag ctttttccat
gtggagactc tctcaatgga 301 cgtgccccct agtgcttctt agacggactg cggtctccta
aagccgcagg tcgaccatgg 361 tggccgggac ccgctgtctt ctagtgttgc tgcttcccca
ggtcctcctg ggcggcgcgg 421 ccggcctcat tccagagctg ggccgcaaga agttcgccgc
ggcatccagc cgacccttgt 481 cccggccttc ggaagacgtc ctcagcgaat ttgagttgag
gctgctcagc atgtttggcc 541 tgaagcagag acccaccccc agcaaggacg tcgtggtgcc
cccctatatg ctagatctgt 601 accgcaggca ctcaggccag ccaggagcgc ccgccccaga
ccaccggctg gagagggcag
661 ccagccgcgc caacaccgtg cgcacgttcc atcaactaga agccgtggag gaacttccag
721 agatgagtgg gaaaacggcc cggcgcttct tcttcaattt aagttctgtc cccagtgacg
781 agtttctcac atctgcagaa ctccagatct tccgggaaca gatacaggaa gctttgggaa
841 acagtagttt ccagcaccga attaatattt atgaaattat aaagcctgca gcagccaact
901 tgaaatttcc tgtgaccaga ctattggaca ccaggttagt gaatcagaac acaagtcagt
961 gggagagctt cgacgtcacc ccagctgtga tgcggtggac cacacaggga cacaccaacc
1021 atgggtttgt ggtggaagtg gcccatttag aggagaaccc aggtgtctcc aagagacatg
1081 tgaggattag caggtctttg caccaagatg aacacagctg gtcacagata aggccattgc
1141 tagtgacttt tggacatgat ggaaaaggac atccgctcca caaacgagaa aagcgtcaag
1201 ccaaacacaa acagcggaag cgcctcaagt ccagctgcaa gagacaccct ttgtatgtgg
1261 acttcagtga tgtggggtgg aatgactgga tcgtggcacc tccgggctat catgcctttt
1321 actgccatgg ggagtgtcct tttccccttg ctgaccacct gaactccact aaccatgcca
1381 tagtgcagac tctggtgaac tctgtgaatt ccaaaatccc taaggcatgc tgtgtcccca
1441 cagagctcag cgcaatctcc atgttgtacc tagatgaaaa tgaaaaggtt gtgctaaaaa
1501 attatcagga catggttgtg gagggctgcg ggtgtcgtta g
//
BioPerl
#!/usr/bin/perl#File: readgb.pl
use strict;use warnings;use Bio::Seq;use Bio::SeqIO;my $inputfile = $ARGV[0];
# create file input objectmy $in = Bio::SeqIO->new(-file=> $inputfile, -format=> 'GenBank'); while (my $seq = $in->next_seq()){ # print out parts of the sequence print "Sequence ", $seq->id(), " with accession ", $seq->accession_number, " and description:\n", $seq->desc, "\n\n";}
Reading GenBank files with Bio::SeqIO
Bio::SeqIO provides methods for reading and printing files in many file formats: FASTA, GenBank, SwissProt, GCG, EMBL, etc
BioPerl
25BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 26
[binf:~/binf634/bioperl] jsolka% ./readgb.pl genbank_sample.txtSequence NM_007553 with accession NM_007553 and description:Mus musculus bone morphogenetic protein 2 (Bmp2), mRNA.
The Program in Action
BioPerl
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 27
Direct Access to Databases
You can create new Bio::Seq objects by importing them directly from a remote database using the Bio::DB family of modules GenBank: Bio::DB::GenBank GenPept Bio::DB::GenPept SwissProt Bio::DB::SwissProt GDB Bio::DB::GDB ACEDB Bio::DB:Ace
BioPerl
Fetching a Sequence from GenBank#!/usr/bin/perluse strict;use warnings;use Bio::Seq;use Bio::DB::GenBank;# File: getgb.pl
# get GI number from command linemy ($id) = @ARGV;
# create a new GenBank handle objectmy $gb = new Bio::DB::GenBank();
# Fetch the sequence from GenBank, returning a seq_objectmy $seq_obj = $gb->get_Seq_by_id($id);
# print out parts of the sequenceprint "Id = ", $id, "\n", "Accession = ", $seq_obj->accession_number, "\n", "Description = ", $seq_obj->desc, "\n", "Sequence = ", $seq_obj->seq, "\n";
BioPerl
28BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 29
[binf:~/binf634/bioperl] jsolka% getgb.pl 1293614Id = 1293614Accession = AAA98665Description = TCP1-beta [Saccharomyces cerevisiae].Sequence = SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEAAEVLLRVDNIIRARPRTANRQHM
•Check out this site for more information•http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
The Program in Action
BioPerl
Bio::Perl has a number of easy to use functions, including
get_sequence - gets sequence from standard internet databases read_sequence - reads a sequence from a file read_all_sequences - reads all sequences from a file new_sequence - makes a bioperl sequence from a string write_sequence - writes one or more sequences to a file translate - provides a translation of a sequence translate_as_string - provides translation of a sequence as string blast_sequence - BLASTs sequence against databases at NCBI write_blast - writes a blast report out to a file
Bio::Perl
BioPerl
30BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)
The Bio::Perl module provides some simple access functions, for example, this script will retrieve a GenBank sequence and write it out in two different formats
#!/usr/bin/perl# File: getgenbank.pluse strict;use warnings;use Bio::Perl;
# this script will only work with an internet connection# on the computer it is run onmy $seq_object = get_sequence('genbank',"ROA1_HUMAN");
# write the sequence in FASTA formatwrite_sequence(">roa1.fasta",'fasta',$seq_object);
# write the sequence in GenBank formatwrite_sequence(">roa1.genbank",'genbank',$seq_object);
exit;
Bio::Perl
BioPerl
31BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)
% cat roa1.fasta>ROA1_HUMAN Heterogeneous nuclear ribonucleoprotein A1 (Helix-destabilizing proteinSKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKRSRGFGFVTYATVEEVDAAMNARPHKVDGRVVEPKRAVSREDSQRPGAHLTVKKIFVGGIKEDTEEHHLRDYFEQYGKIEVIEIMTDRGSGKKRGFAFVTFDDHDSVDKIVIQKYHTVNGHNCEVRKALSKQEMASASSSQRGRSGSGNFGGGRGGGFGGNDNFGRGGNFSGRGGFGGSRGGGGYGGSGDGYNGFGNDGGYGGGGPGYSGGSRGYGSGGQGYGNQGSGYGGSGSYDSYNNGGGRGFGGGSGSNFGGGGSYNDFGNYNNQSSNFGPMKGGNFGGRSSGPYGGGGQYFAKPRNQGGYGGSSSSSSYGSGRRF
% cat roa1.genbankLOCUS ROA1_HUMAN 371 aa linear HUMAN 01-MAR-1989DEFINITION Heterogeneous nuclear ribonucleoprotein A1 (Helix-destabilizing protein) (Single-strand binding protein) (hnRNP core protein A1).ACCESSION P09651KEYWORDS Nuclear protein; RNA-binding; Repeat; Ribonucleoprotein; Methylation; Phosphorylation; Alternative splicing; 3D-structure; Polymorphism.SOURCE Human ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.ORIGIN 1 sksespkepe qlrklfiggl sfettdeslr shfeqwgtlt dcvvmrdpnt krsrgfgfvt 61 yatveevdaa mnarphkvdg rvvepkravs redsqrpgah ltvkkifvgg ikedteehhl 121 rdyfeqygki evieimtdrg sgkkrgfafv tfddhdsvdk iviqkyhtvn ghncevrkal 181 skqemasass sqrgrsgsgn fgggrgggfg gndnfgrggn fsgrggfggs rggggyggsg 241 dgyngfgndg gyggggpgys ggsrgygsgg qgygnqgsgy ggsgsydsyn ngggrgfggg 301 sgsnfggggs yndfgnynnq ssnfgpmkgg nfggrssgpy ggggqyfakp rnqggyggss 361 ssssygsgrr f// 32
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 33
LWP: fetch WWW documents To automate WWW access LWP::Simple - procedural interface to LWP Example of usage:use LWP::Simple;
$url = "http://www.whatever.com/data.html";
$page = get($url);
if ($page)
{ # do something }
else { print "Problems getting $url"; } See the following URL for an example
http://stein.cshl.org/genome_informatics/chervitz/perl-modules/ http://oreilly.com/openbook/webclient/ch05.html
BioPerl
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 34
Protein Data Bank The Protein Data Bank (PDB) is
a repository of protein data structure
http://www.rcsb.org/pdb/home/home.do
The nature of proteins Primary sequences Secondary structures
Alpha helices Beta strands Turns
Tertiary structures Beta sheets Alpha-alpha units
Quaternary structures Enzyme Virus
PDB
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 35
Overview of PDB
ASCII flat files that are human readable
One structure per file
Problems with consistency of the formats Code you create today may not work
tomorrow
BioPerlPDB
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 36
Files and Folders
PDB is distributed as files within directories
There are in-depth discussions of how to handle these directory structures within Chapter 11 of Tisdall We will talk a little about recursion next
week
PDB
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 37
An Example PDB File - IHEADER IMMUNOGLOBULIN 10-MAR-99 43C9 TITLE CRYSTALLOGRAPHIC STRUCTURE OF THE ESTEROLYTIC AND TITLE 2 AMIDOLYTIC 43C9 ANTIBODY COMPND MOL_ID: 1; COMPND 2 MOLECULE: IMMUNOGLOBULIN (LIGHT CHAIN); COMPND 3 CHAIN: A, C, E, G; COMPND 4 FRAGMENT: FV; COMPND 5 ENGINEERED: YES; COMPND 6 MOL_ID: 2; COMPND 7 MOLECULE: IMMUNOGLOBULIN (HEAVY CHAIN); COMPND 8 CHAIN: B, D, F, H; COMPND 9 FRAGMENT: FV; COMPND 10 ENGINEERED: YES SOURCE MOL_ID: 1; SOURCE 2 ORGANISM_SCIENTIFIC: MUS MUSCULUS; SOURCE 3 ORGANISM_COMMON: MOUSE; SOURCE 4 EXPRESSION_SYSTEM: ESCHERICHIA COLI; SOURCE 5 EXPRESSION_SYSTEM_CELL_LINE: BL21(DE3); SOURCE 6 MOL_ID: 2; SOURCE 7 ORGANISM_SCIENTIFIC: MUS MUSCULUS; SOURCE 8 ORGANISM_COMMON: MOUSE; SOURCE 9 EXPRESSION_SYSTEM: ESCHERICHIA COLI; SOURCE 10 EXPRESSION_SYSTEM_CELL_LINE: BL21(DE3) KEYWDS IMMUNOGLOBULIN EXPDTA X-RAY DIFFRACTION
AUTHOR M.M.THAYER,E.D.GETZOFF,V.A.ROBERTS
REVDAT 1 18-AUG-99 43C9 0
JRNL AUTH M.M.THAYER,E.H.OLENDER,A.S.ARVAI,C.K.KOIKE,
JRNL AUTH 2 I.L.CANESTRELLI,J.D.STEWART,S.J.BENKOVIC,
JRNL AUTH 3 E.D.GETZOFF,V.A.ROBERTS
JRNL TITL STRUCTURAL BASIS FOR AMIDE HYDROLYSIS CATALYZED BY
JRNL TITL 2 THE 43C9 ANTIBODY
JRNL REF J.MOL.BIOL. V. 291 329 1999
JRNL REFN ASTM JMOBAK UK ISSN 0022-2836 0070
REMARK 1
REMARK 1 REFERENCE 1
REMARK 1 AUTH V.A.ROBERTS,J.STEWART,S.J.BENKOVIC,E.D.GETZOFF
REMARK 1 TITL CATALYTIC ANTIBODY MODEL AND MUTAGENESIS IMPLICATE
REMARK 1 TITL 2 ARGININE IN TRANSITION- STATE STABILIZATION
REMARK 1 REF J.MOL.BIOL. V. 235 1098 1994
REMARK 1 REFN ASTM JMOBAK UK ISSN 0022-2836 0070
REMARK 1 REFERENCE 2 ...
PDB
An Example PDB File - IIREMARK 800 REMARK 800 HIS 91 IN THE VARIABLE LIGHT CHAIN (A,C,E, OR G) - REMARK 800 NUCLEOPHILE THAT FORMS A COVALENT BOND TO THE REMARK 800 SUBSTRATE REMARK 800 REMARK 800 ARG 96 IN THE VARIABLE LIGHT CHAIN (A,C,E, OR G)- REMARK 800 SIDE CHAIN STABILIZES NEGATIVE CHARGE FORMED IN REMARK 800 THE TRANSITION STATES OF THE REACTION REMARK 800 REMARK 900 REMARK 900 RELATED ENTRIES REMARK 900 RELATED ID: 43CA RELATED DB: PDB DBREF 43C9 A 1 107 GB 207840 M68968 1 113 DBREF 43C9 B 0 112 GB 207840 M68968 127 244 DBREF 43C9 C 1 107 GB 207840 M68968 1 113 DBREF 43C9 D 0 112 GB 207840 M68968 127 244 DBREF 43C9 E 1 107 GB 207840 M68968 1 113 DBREF 43C9 F 0 112 GB 207840 M68968 127 244 DBREF 43C9 G 1 107 GB 207840 M68968 1 113 DBREF 43C9 H 0 112 GB 207840 M68968 127 244SEQRES 1 A 113 ASP VAL VAL MET THR GLN THR PRO SER SER LEU ALA MET SEQRES 2 A 113 SER VAL GLY GLN LYS VAL THR MET SER CYS LYS SER SER SEQRES 3 A 113 GLN SER LEU LEU ASN ILE SER ASN GLN LYS ASN TYR LEU SEQRES 4 A 113 ALA TRP TYR GLN GLN LYS PRO GLY GLN SER PRO LYS LEU SEQRES 5 A 113 LEU VAL TYR PHE ALA SER THR ARG GLU SER GLY VAL PRO SEQRES 6 A 113 ASP ARG PHE ILE GLY SER GLY SER GLY THR ASP PHE THR SEQRES 7 A 113 LEU THR ILE SER SER VAL GLN ALA GLU ASP GLN ALA ASP SEQRES 8 A 113 TYR PHE CYS GLN GLN HIS TYR ARG ALA PRO ARG THR PHE
SEQRES 9 A 113 GLY GLY GLY THR LYS LEU GLU ILE LYS SEQRES 1 B 118 GLY GLN VAL GLN LEU VAL GLU SER GLY PRO GLY LEU VAL SEQRES 2 B 118 ALA PRO SER GLN SER LEU SER ILE THR CYS THR VAL SER SEQRES 3 B 118 GLY ILE SER LEU SER ARG TYR ASN VAL HIS TRP VAL ARG SEQRES 4 B 118 GLN SER PRO GLY LYS GLY LEU GLU TRP LEU GLY MET ILE SEQRES 5 B 118 TRP GLY GLY GLY SER ILE GLU TYR ASN PRO ALA LEU LYS SEQRES 6 B 118 SER ARG LEU SER ILE SER LYS ASP ASN SER LYS SER GLN SEQRES 7 B 118 ILE PHE LEU LYS MET ASN SER LEU GLN THR ASP ASP SER SEQRES 8 B 118 ALA MET TYR TYR CYS VAL SER TYR GLY TYR GLY GLY ASP SEQRES 9 B 118 ARG PHE SER TYR TRP GLY GLN GLY THR LEU VAL THR VAL SEQRES 10 B 118 SER ...FORMUL 9 HOH *162(H2 O1) SHEET 1 A 4 MET A 4 THR A 7 0 SHEET 2 A 4 VAL A 19 LYS A 24 -1 N LYS A 24 O THR A 5 SHEET 3 A 4 ASP A 70 ILE A 75 -1 N ILE A 75 O VAL A 19 SHEET 4 A 4 PHE A 62 SER A 67 -1 N SER A 67 O ASP A 70 SHEET 1 B 5 SER A 10 SER A 14 0 SHEET 2 B 5 THR A 102 LYS A 107 1 N LYS A 103 O LEU A 11 SHEET 3 B 5 ALA A 84 GLN A 90 -1 N TYR A 86 O THR A 102 SHEET 4 B 5 LEU A 33 GLN A 38 -1 N GLN A 38 O ASP A 85 SHEET 5 B 5 PRO A 44 VAL A 48 -1 N VAL A 48 O TRP A 35 SHEET 1 C 4 GLN B 3 SER B 7 0 SHEET 2 C 4 LEU B 18 SER B 25 -1 N SER B 25 O GLN B 3 SHEET 3 C 4 GLN B 77 MET B 82 -1 N MET B 82 O LEU B 18 ...
PDB
38BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB,
and BLAST)
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 39
An Example PDB File-IIIATOM 1 N ASP A 1 7.422 50.761 33.806 1.00 64.60 N ATOM 2 CA ASP A 1 8.287 50.204 34.870 1.00 77.57 C ATOM 3 C ASP A 1 8.848 48.834 34.503 1.00 77.32 C ATOM 4 O ASP A 1 9.752 48.357 35.183 1.00 82.14 O ATOM 5 CB ASP A 1 7.507 50.107 36.205 1.00 86.78 C ATOM 6 CG ASP A 1 8.392 49.691 37.423 1.00 92.57 C ATOM 7 OD1 ASP A 1 9.630 49.891 37.411 1.00 93.28 O ATOM 8 OD2 ASP A 1 7.826 49.183 38.422 1.00 96.45 O ATOM 9 N VAL A 2 8.333 48.184 33.458 1.00 70.66 N ATOM 10 CA VAL A 2 8.864 46.859 33.118 1.00 63.77 C ATOM 11 C VAL A 2 10.397 47.012 32.964 1.00 58.68 C ATOM 12 O VAL A 2 10.879 47.598 32.010 1.00 62.61 O ATOM 13 CB VAL A 2 8.149 46.194 31.855 1.00 62.22 C ...CONECT 6928 6353
MASTER 381 0 0 13 80 0 4 24 7254 8 16
76 END
PDB
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 40
PDB File Format - I
In-depth discussions are available here http://www.wwpdb.org/docs.html
PDB files are composed of 80 columns that begin with one of several record names and end with a newline
Blank columns are padded with spaces A record type is one or more lines with the
same record name Each record type has a different number of
fields with them
PDB
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 41
PDB File Formats -II
Record Types DBREF
Reference to the entry in the sequence database SEQADV
Identification of conflicts between pdb and the named sequence database
SEQRES Primary sequence of backbone residues
MODRES Identification of modification to standard residues
PDB
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 42
Our Example PDB File – The DBREF Section
DBREF 43C9 A 1 107 GB 207840 M68968 1 113
DBREF 43C9 B 0 112 GB 207840 M68968 127 244
DBREF 43C9 C 1 107 GB 207840 M68968 1 113
DBREF 43C9 D 0 112 GB 207840 M68968 127 244
DBREF 43C9 E 1 107 GB 207840 M68968 1 113
DBREF 43C9 F 0 112 GB 207840 M68968 127 244
DBREF 43C9 G 1 107 GB 207840 M68968 1 113
DBREF 43C9 H 0 112 GB 207840 M68968 127 244
This tells us there is a PDB file called pdb43cp.ent and that the residues in chain A are labeled from 1 to 107 in the original GenBank database and the ID number is 207840 and the name m68968 are assigned to that database and the residues are numbered 1 to 113 in PDB
PDB
Our Example PDB File – The SEQRES Section
SEQRES 1 A 113 ASP VAL VAL MET THR GLN THR PRO SER SER LEU ALA MET SEQRES 2 A 113 SER VAL GLY GLN LYS VAL THR MET SER CYS LYS SER SER SEQRES 3 A 113 GLN SER LEU LEU ASN ILE SER ASN GLN LYS ASN TYR LEU SEQRES 4 A 113 ALA TRP TYR GLN GLN LYS PRO GLY GLN SER PRO LYS LEU SEQRES 5 A 113 LEU VAL TYR PHE ALA SER THR ARG GLU SER GLY VAL PRO SEQRES 6 A 113 ASP ARG PHE ILE GLY SER GLY SER GLY THR ASP PHE THR SEQRES 7 A 113 LEU THR ILE SER SER VAL GLN ALA GLU ASP GLN ALA ASP SEQRES 8 A 113 TYR PHE CYS GLN GLN HIS TYR ARG ALA PRO ARG THR PHE SEQRES 9 A 113 GLY GLY GLY THR LYS LEU GLU ILE LYS
Col Data Type Field Def
1-6 Record name
“SEQRES”
9-10 Integer serNum Serial # of the SEQRES rec for the current chain. Starts at 1 and increments by 1. Resets for a new chain.
12 Character Chain ID Chain identifier may be a single character including a blank.
14-17 Integer numRes Number of residues in the chain. Repeated in every record.
20-22 Residue name
resName Residue name
24-26 Residue name
resName Residue name
PDB
43
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 44
Using BioPerl to Parse a pdb File
#!/usr/local/bin/perl -w
# Parsing PDB reports with BioPerl's Bio::Structure:IO module
use strict;use Bio::Structure::IO;
my $inFile;
# Prompt the user for the file name if it's not an argument
# NOTE: PDB file must be in text (not html) formatif (! $ARGV[0]){ print "What is the PDB file to parse? ";
# Get input and remove the newline character at the end
chomp ($inFile = <STDIN>);}else{ $inFile = $ARGV[0];}
my $structio = Bio::Structure::IO->new(-file => $inFile);
my $struc = $structio->next_structure; for my $chain ($struc->get_chains) { my $chainid = $chain->id; # one-letter chaincode if present,
'default' otherwise for my $res ($struc-
>get_residues($chain)) { my $resid = $res->id; # format is 3-lettercode - dash -
residue number, e.g. PHE-20 my $atoms = $struc->get_atoms($res); # actually a list of atom objects,
used here to get a count print join "\t", $chainid,$resid,
$atoms,"\n"; }}
BioPerl and PDB
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 45
The Code in Action ./pdb_parse_0.pl pdb43c9.ent >
pdb_parse_0.outA ASP-1 8A VAL-2 7A VAL-3 7A MET-4 8A THR-5 7A GLN-6 9A THR-7 7A PRO-8 7A SER-9 6A SER-10 6A LEU-11 8A ALA-12 5...
BioPerl and PDB
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 46
The Code in Action (cont.)
H GLY-106 4H THR-107 7H LEU-108 8H VAL-109 7H THR-110 7H VAL-111 7H SER-112 7default HOH-1 1default HOH-2 1default HOH-3 1default HOH-4 1default HOH-5 1...default HOH-440 1default HOH-449 1default HOH-454 1default HOH-456 1default HOH-463 1default HOH-469 1default HOH-472 1
BioPerl and PDB
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 47
BLAST (http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHome)
The Basic Local Alignment Search Tool (BLAST)
Tests a query against a library of known sequences to find similarity
Collection of programs query to database pairs
nucleotide-nucleotide protein-nucleotide protein-protein nucleotid protein
BLAST
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 48
BLAST Advantages of a Local Installation Versus Public Server
Using public server Databases used by BLAST such as GenBank
are always up to date
Using local server Proprietary sequences Sequences that are run time and time again
BLAST
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 49
String Matching and Homology BLAST is basically a string matching program Biological string matching looks for similarity as an
indication of homology Similarity between the query and the sequences in the
database may be measured by the percent identity, or the number of bases in the query that exactly match a corresponding region of a sequence from the database.
It may be measured by the degree of conservation , which finds matches between equivalent (redundant) codons or between amino acid residues with similar properties that don’t alter the function of a protein
Two sequences are or are not homologous, there’s no degree of homology
Beginning PERL FOR Bioinformatics, Tisdall
BLAST
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 50
Homology It is difficult to find a single consistent definition of homology
“In genetics, the term "homolog" is used both to refer to a homologous protein, and to the gene (DNA sequence) encoding it.”
-- Wikipedia
In-depth philosophical discussions can be found in
G. P. Wagner, “The Biological Homology Concept,” Annu. Rev. Ecol. Syst. 1989. 20:51~9.
BLAST
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 51
String Matching References Algorithms on String, Trees, and Sequences:
Computer Science and Computational Biology, Dan Gusfield: Cambridge University Press, 1997.
Defacto Standard
R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.
These two complement each other nicely
BLAST
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 52
Outputs of a BLAST Search - I
Standard outputs A set of scores and statistics on the matches based
on the raw score S, various parameters of the scoring algorithm, and properties of the query and database
The raw score S A measure of similarity and the size of the match
BLAST lists hits by their E value The E value (expect) value of a match estimates the
chance that the string match (allowing for gaps) occurs in a randomly generate database of the same size and composition.
BLAST
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 53
Outputs of a BLAST Search - II
The closer the E value is to 0 the less likely that the match occurred by chance The better the match
For BLASTN E value less than 1 may be a solid hit E value less than 10 may be worth looking
at
BLAST
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 54
Screen Capture of Example BLAST File Using First 8 Lines (400 Bases) of the File sample.dna From Chapter 8 of Tisdall as a Query Against the nt BLAST Database
BLAST
Some additional tutorial information can be found athttp://www.ncbi.nlm.nih.gov/Class/minicourses/
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 55
A Closer Look at This File
BLAST
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 56
Output from Clicking on the First Line
BLAST
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 57
A Little Further Down in the File
BLAST
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 58
Another Example BLAST File (Text Version) - I
BLASTN 2.2.2 [Jan-08-2002]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
Query= gi|16972687|gb|BM083928.1|BM083928imageqc_2_2001/smj416bdff41.x1 Soares_NFL_T_GBC_S1 Homo
sapiens cDNAclone IMAGE:4597167 3', mRNA sequence (505 letters)
Database: Non-redundant nucleotide - Mon Feb 18 09:50:08 EST 2002
1,171,752 sequences; 566,364,212 total letters
Searching..................................................done
Score E
Sequences producing significant alignments: (bits) Value
ref|NM_017857.1| Homo sapiens slingshot 3 (SSH-3), mRNA 957 0.0
ref|XM_040057.1| Homo sapiens slingshot 3 (SSH-3), mRNA 957 0.0
dbj|AK000522.1|AK000522 Homo sapiens cDNA FLJ20515 fis, clone KA... 957 0.0
gb|BC004210.1|BC004210 Homo sapiens, Similar to hypothetical pro... 954 0.0
gb|BC004176.1|BC004176 Homo sapiens, Similar to hypothetical pro... 954 0.0
gb|AF085851.1|HUMYI53C10 Homo sapiens full length insert cDNA cl... 944 0.0
ref|NM_018276.1| Homo sapiens slingshot 3 (SSH-3), mRNA 615 e-174
dbj|AK001790.1|AK001790 Homo sapiens cDNA FLJ10928 fis, clone OV... 615 e-174
emb|AL353811.12|AL353811 Human DNA sequence from clone RP11-482I... 46 0.004
emb|AL035531.17|HS483L3 Human DNA sequence from clone RP3-483L3 ... 44 0.014
>ref|NM_017857.1| Homo sapiens slingshot 3 (SSH-3), mRNA Length = 2064
Score = 957 bits (483), Expect = 0.0 Identities = 492/494 (99%), Gaps = 1/494 (0%) Strand = Plus / Minus
Query: 4 aataaccaaatatgaaaatgtgttttatttctcagtacaaagccagatactgtaaggcta 63
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 2046 aataaccaaatatgaaaatgtgttttatttctcagtacaaagccagatactgtaaggcta 1987
BLAST
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 59
Another Example BLAST File (Text Version) - II
…Matrix: blastn matrix:1 -3Gap Penalties: Existence: 5, Extension: 2Number of Hits to DB: 1,627,832Number of Sequences: 1171752Number of extensions: 1627832Number of successful extensions: 453460Number of sequences better than 5.0e-02: 10length of query: 505length of database: 566,364,212effective HSP length: 19effective length of query: 486effective length of database: 544,100,924effective search space: 264433049064effective search space used: 264433049064T: 0A: 40X1: 6 (11.9 bits)X2: 15 (29.7 bits)S1: 12 (24.3 bits)S2: 22 (44.1 bits
BLAST
Parsing BLAST reports with SearchIO#!/usr/local/bin/perl -w
# Parsing BLAST reports with BioPerl's Bio::SearchIO module# WI Biocomputing course - Bioinformatics for Biologists -
October 2003
# See help at http://www.bioperl.org/wiki/HOWTO:SearchIO for all data that can be extracted
use Bio::SearchIO;
# Prompt the user for the file name if it's not an argument# NOTE: BLAST file must be in text (not html) formatif (! $ARGV[0]){ print "What is the BLAST file to parse? ";
# Get input and remove the newline character at the end chomp ($inFile = <STDIN>);}else{ $inFile = $ARGV[0];}
$report = new Bio::SearchIO( -file=>"$inFile", -format => "blast");
print "QueryAcc\tHitDesc\tHitSignif\tHSP_rank\t\%ID\teValue\tHSP_length\n";
# Go through BLAST reports one by one
while($result = $report->next_result) { # Go through each each matching sequence while($hit = $result->next_hit) { # Go through each each HSP for this
sequence while ($hsp = $hit->next_hsp) { # Print some tab-delimited data
about this HSP print $result->query_accession, "\
t"; print $hit->description, "\t"; print $hit->significance, "\t"; print $hsp->rank, "\t"; print $hsp->percent_identity, "\t"; print $hsp->evalue, "\t"; print $hsp->hsp_length, "\n"; } } }
BioPerl and BLAST
60
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 61
Output from blast_parse_0.plQueryAcc HitDesc HitSignif HSP_rank %ID eValue HSP_lengthBM083928.1 Homo sapiens slingshot 3 (SSH-3), mRNA 0.0 1 99.5951417004049
0.0 494BM083928.1 Homo sapiens slingshot 3 (SSH-3), mRNA 0.0 1 99.5951417004049
0.0 494BM083928.1 Homo sapiens cDNA FLJ20515 fis, clone KAT09889 0.0 1 99.5951417004049
0.0 494BM083928.1 Homo sapiens, Similar to hypothetical protein FLJ10928, clone MGC:4436 IMAGE:2958967, mRNA,
complete cds0.0 1 99.5934959349593 0.0 492BM083928.1 Homo sapiens, Similar to hypothetical protein FLJ10928, clone MGC:2772 IMAGE:2958967, mRNA,
complete cds0.0 1 99.5934959349593 0.0 492BM083928.1 Homo sapiens full length insert cDNA clone YI53C10 0.0 1
99.5893223819302 0.0 487BM083928.1 Homo sapiens slingshot 3 (SSH-3), mRNA 1e-174 1 99.685534591195
1e-174 318BM083928.1 Homo sapiens slingshot 3 (SSH-3), mRNA 1e-174 2 100 6e-48
98BM083928.1 Homo sapiens slingshot 3 (SSH-3), mRNA 1e-174 3 98.1132075471698
1e-18 53BM083928.1 Homo sapiens cDNA FLJ10928 fis, clone OVARC1000473, weakly similar to DUAL SPECIFICITY PROTEIN
PHOSPHATASE 3 (EC 3.1.3.48) (EC 3.1.3.16) 1e-174 1 99.685534591195 1e-174318
BM083928.1 Homo sapiens cDNA FLJ10928 fis, clone OVARC1000473, weakly similar to DUAL SPECIFICITY PROTEIN PHOSPHATASE 3 (EC 3.1.3.48) (EC 3.1.3.16) 1e-174 2 100 6e-48 98
BM083928.1 Homo sapiens cDNA FLJ10928 fis, clone OVARC1000473, weakly similar to DUAL SPECIFICITY PROTEIN PHOSPHATASE 3 (EC 3.1.3.48) (EC 3.1.3.16) 1e-174 3 98.1132075471698 1e-1853
BM083928.1 Human DNA sequence from clone RP11-482I10 on chromosome 9 Contains STSs and GSSs, complete sequence [Homo sapiens] 0.004 1 100 0.004 23
BM083928.1 Human DNA sequence from clone RP3-483L3 on chromosome 6p25.1-25.3 Contains STSs and GSSs, complete sequence [Homo sapiens] 0.014 1 100 0.014 22
BioPerl and BLAST
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 62
Summary BioPerl offer several modules that make processing sequence
data convenient
Can fetch sequence data and GenBank record directly from GenBank
More info Online documentation:% perldoc Bio::Perl% perldoc Bio::DB::GenBank% perldoc Bio::SeqIO http://www.bioperl.org/wiki http://www.bioperl.org/wiki/HOWTOs
BioPerl
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 63
Homework
Read Chapter 12 Exercise 12.4 Exercise 12.6