Biopython programming workshop at UGA
-
Upload
eric-talevich -
Category
Technology
-
view
6.094 -
download
3
description
Transcript of Biopython programming workshop at UGA
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
IOB Workshop: BiopythonA programming toolkit for bioinformatics
Eric Talevich
Institute of Bioinformatics, University of Georgia
Mar. 29, 2012
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Getting startedwith
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Installing Python
Biopython is a library for the Python programming language.
First, you’ll need these installed:
Python 2.7 from http://python.org. It may already beinstalled on your computer. (Version 2.6 is OK, too.)
IDLE, a simple Integrated DeveLopment Environment.Usually bundled with the Python distribution.
Now, start an interactive session in IDLE. 1
1On your own, check out IPython (http://ipython.scipy.org/). It’s anenhanced Python interpreter that feels somewhat like R.
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Installing Python packages
Biopython is a Python package. There are a few standard ways toinstall Python packages:
From source: Download from PyPI 2, unpack and install with theincluded setup.py script.
easy install: Install from source 3, then use the easy install
command to fetch install all other packages by name:$ easy install <package name>
pip: Like easy install, use pip 4 to manage packages:$ pip install <package name>
2http://pypi.python.org/pypi/3http://pypi.python.org/pypi/setuptools4http://pypi.python.org/pypi/pip
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Installing NumPy, matplotlib and Biopython
Biopython relies on a few other Python packages for extrafunctionality. We’ll use these:
numpy — efficient numerical functions and data structures(for Bio.PDB)
matplotlib — plotting (for Bio.Phylo)
Then finally:
biopython — the reason we’re here today
(Biopython, NumPy, matplotlib, setuptools and pip are also packaged for
many Linux distributions.)
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Testing
Check your Biopython installation:
>>> import Bio
>>> print Bio. version
Import a NumPy-based component:
>>> from Bio import PDB
Show a simple plot:
>>> from matplotlib import pyplot
>>> pyplot.plot(range(5), range(5))
>>> pyplot.show()
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Let’s start using
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Biopython1 Sequences and alignments
The Seq objectSeqIO and the SeqRecord object
2 NCBI EUtils and BLASTEUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
3 Phylogenetics
4 Protein structures
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
Sequencesand
Alignments
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
The Seq object
>>> from Bio.Seq import Seq
>>> myseq = Seq(’AGTACACTGGT’)
>>> myseq
Seq(’AGTACACTGGT’, Alphabet())
>>> print myseq
AGTACACTGGT
>>> myseq.transcribe()
Seq(’AGUACACUGGU’, RNAAlphabet())
>>> myseq.translate()
Seq(’STL’, ExtendedIUPACProtein())
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
A Seq object consists of:
data — the underlying Python character string
alphabet — DNA, RNA, protein, etc.
It supports most Python string methods:>>> myseq.count(’GT’)
2
And some biology-specific methods, too:>>> myseq.reverse complement()
Seq(’ACCAGTGTACT’, Alphabet())
Intrigued? Read on:>>> help(Seq)
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
SeqIO: Sequence Input/Output
Sequence data is stored in many different file formats.Bio.SeqIO supports:
abi fastq phylip swissace genbank pir tab
clustal ig qual uniprot-xmlembl imgt seqxml
emboss nexus sfffasta phd stockholm
Manually fetch some data from the PDB website: 5
1ATP.fasta — two protein sequences, FASTA format
1ATP.pdb — the 3D structure, for later
5http://www.rcsb.org/pdb/explore/explore.do?structureId=1ATP
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
The SeqIO API
SeqIO provides four functions:
parse: Iteratively parse all elements in the file
read: Parse a one-element file and return the element
write: Write elements to a file
convert: Parse one format and immediately write another
Biopython uses the same I/O conventions for alignments(AlignIO), BLAST results (Blast), and phylogenetic trees(Phylo).
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
The SeqRecord object
SeqIO.parse returns SeqRecords.SeqRecord wraps a Seq object and attaches metadata.
1 Pass the file name to the SeqIO parser; specify FASTA format:from Bio import SeqIO
seqrecs = SeqIO.parse("1ATP.fasta", "fasta")
print seqrecs
2 To see all records at once, convert the iterator to a list:allrecs = list(seqrecs)
print allrecs[0]
print allrecs[0].seq
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
The SeqRecord object
SeqIO.parse returns SeqRecords.SeqRecord wraps a Seq object and attaches metadata.
1 Pass the file name to the SeqIO parser; specify FASTA format:from Bio import SeqIO
seqrecs = SeqIO.parse("1ATP.fasta", "fasta")
print seqrecs
2 To see all records at once, convert the iterator to a list:allrecs = list(seqrecs)
print allrecs[0]
print allrecs[0].seq
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
Example: Shuffled sequences
Given a real DNA sequence, create a “background” set ofrandomized sequences with the same composition.
Procedure:
1 Read the source sequence from a file– Use Bio.SeqIO
2 In a loop:
Shuffle the sequence– Use random.shuffle from Python’s standard libraryCreate a new SeqRecord from the shuffled sequence– Because SeqIO.write works with SeqRecords
3 Write the shuffled SeqRecords to another file
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
Example: Shuffled sequences
Given a real DNA sequence, create a “background” set ofrandomized sequences with the same composition.
Procedure:
1 Read the source sequence from a file– Use Bio.SeqIO
2 In a loop:
Shuffle the sequence– Use random.shuffle from Python’s standard libraryCreate a new SeqRecord from the shuffled sequence– Because SeqIO.write works with SeqRecords
3 Write the shuffled SeqRecords to another file
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
Example: Shuffled sequences
Given a real DNA sequence, create a “background” set ofrandomized sequences with the same composition.
Procedure:
1 Read the source sequence from a file– Use Bio.SeqIO
2 In a loop:
Shuffle the sequence– Use random.shuffle from Python’s standard libraryCreate a new SeqRecord from the shuffled sequence– Because SeqIO.write works with SeqRecords
3 Write the shuffled SeqRecords to another file
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
import randomfrom Bio import SeqIOfrom Bio . Seq import Seqfrom Bio . SeqRecord import SeqRecord
o r i g r e c = SeqIO . r e a d ("gi2.gb" , "genbank" )a l p h a b e t = o r i g r e c . seq . a l p h a b e to u t r e c s = [ ]f o r i i n x r a n g e ( 1 , 3 1 ) :
n u c l e o t i d e s = l i s t ( o r i g r e c . seq )random . s h u f f l e ( n u c l e o t i d e s )new seq = Seq ("" . j o i n ( n u c l e o t i d e s ) , a l p h a b e t )n e w r e c = SeqRecord ( new seq ,
i d="shuffle" + s t r ( i ) )o u t r e c s . append ( n e w r e c )
SeqIO . w r i t e ( o u t r e c s , "gi2_shuffled.fasta" , "fasta" )
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
Example: ORF translation
Split a set of unannotated DNA sequences into uniqueORFs, translating in all 6 frames.
Biopython can help with each piece of this problem:
1 Parse the given unannotated DNA sequences (SeqIO.parse)
2 Get the template strand’s sequence (Seq.reverse complement)
3 Translate both strands into protein sequences (Seq.translate)
4 Shift each strand by +1 and +2 for alternate reading frames(string-like Seq slicing)
5 Split sequences at stop codons (Seq.split(’*’))
6 Write translated sequences to a new file (SeqIO.write)
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
def t r a n s l a t e s i x f r a m e s ( seq , t a b l e =1):””” T r a n s l a t e a n u c l e o t i d e s e q u e n c e i n 6 f r a m e s .
R e t u r n s an i t e r a b l e o f 6 t r a n s l a t e d p r o t e i ns e q u e n c e s .”””r e v = seq . r e v e r s e c o m p l e m e n t ( )f o r i i n ra ng e ( 3 ) :
# Coding ( C r i c k ) s t r a n dy i e l d seq [ i : ] . t r a n s l a t e ( t a b l e )
# Template ( Watson ) s t r a n dy i e l d r e v [ i : ] . t r a n s l a t e ( t a b l e )
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
def t r a n s l a t e o r f s ( sequences , m i n p r o t l e n =60):””” Find and t r a n s l a t e a l l ORFs i n s e q u e n c e s .
T r a n s l a t e s each s e q u e n c e i n a l l 6 r e a d i n g frames ,s p l i t s s e q u e n c e s on s t o p codons , and p r o d u c e s ani t e r a b l e o f a l l p r o t e i n s e q u e n c e s o f l e n g t h a tl e a s t m i n p r o t l e n .”””f o r seq i n s e q u e n c e s :
f o r f rame i n t r a n s l a t e s i x f r a m e s ( seq ) :f o r p r o t i n f rame . s p l i t ("*" ) :
i f l e n ( p r o t ) >= m i n p r o t l e n :y i e l d p r o t
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
from Bio import SeqIOfrom Bio . SeqRecord import SeqRecord
i f n a m e == "__main__" :import s y si n f i l e = s y s . s t d i no u t f i l e = s y s . s t d o u tr e c o r d s = SeqIO . p a r s e ( i n f i l e , "fasta" )s e q s = ( r e c . seq f o r r e c i n r e c o r d s )p r o t e i n s = t r a n s l a t e o r f s ( s e q s )s e q r e c s = ( SeqRecord ( seq , i d="orf"+s t r ( i ) )
f o r i , seq i n enumerate ( o r f s ) )SeqIO . w r i t e ( s r e c s , o u t f i l e , "fasta" )
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
AlignIO and the Alignment object
Alignment: a set of sequences with the same length and alphabet.
Use AlignIO just like SeqIO:>>> from Bio import AlignIO
>>> aln = AlignIO.read("PF01601.sto", "stockholm")
>>> print alnSingleLetterAlphabet() alignment with 22 rows and 730 columns
NCTDAV-----LTYSSFGVCADGSIIA-VQPRNV-----SYDSV...HIQ Q1HVL3 CVH22/539-1170
NCTTAV-----MTYSNFGICADGSLIP-VRPRNS-----SDNGI...HVQ SPIKE CVHNL/723-1356
NCTEPV-----LVYSNIGVCKSGSIGY-VPSQS------GQVKI...HVQ Q692M1 9CORO/740-1383
NCTEPA-----LVYSNIGVCKNGAIGL-VGIRN------TQPKI...HIQ Q0Q4F4 9CORO/729-1360
NCTSPR-----LVYSNIGVCTSGAIGL-LSPKX------AQPQI...HVQ Q0Q4F6 9CORO/743-1371
NCTNPV-----LTYSSYGVCPDGSITR-LGLTD------VQPHF...--T A4ULL0 9CORO/726-1328
NCTKPV-----LSYGPISVCSDGAIAG-TSTLQN-----TRPSI...KEW A6N263 9CORO/406-1035
ECDIPIGAGICASYHTVSLLRSTSQKSIVAYTMS------LGAD...HYT Q6T7X8 CVHSA/647-1255
...
DCE-PV-----ITYSNIGVCKNGAFVF-INVTH------SDGDV...HVH Q0PKZ5 CVPPU/797-1449
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
Snack Time
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
EUtils and BLAST
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
EUtils: Entrez Programming Utilities
Access NCBI’s online services:from Bio import Entrez
Entrez.email = "[email protected]"
Request a GenBank record:handle = Entrez.efetch(db="protein", id="69316",
rettype="gb", retmode="text")
record = SeqIO.read(handle, "gb")
Specify multiple IDs in one query:handle = Entrez.efetch(db="protein",
id="349839,349840",
rettype="fasta", retmode="text")
records = SeqIO.parse(handle, "fasta")
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
EUtils: Entrez Programming Utilities
Access NCBI’s online services:from Bio import Entrez
Entrez.email = "[email protected]"
Request a GenBank record:handle = Entrez.efetch(db="protein", id="69316",
rettype="gb", retmode="text")
record = SeqIO.read(handle, "gb")
Specify multiple IDs in one query:handle = Entrez.efetch(db="protein",
id="349839,349840",
rettype="fasta", retmode="text")
records = SeqIO.parse(handle, "fasta")
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
EUtils: Entrez Programming Utilities
Access NCBI’s online services:from Bio import Entrez
Entrez.email = "[email protected]"
Request a GenBank record:handle = Entrez.efetch(db="protein", id="69316",
rettype="gb", retmode="text")
record = SeqIO.read(handle, "gb")
Specify multiple IDs in one query:handle = Entrez.efetch(db="protein",
id="349839,349840",
rettype="fasta", retmode="text")
records = SeqIO.parse(handle, "fasta")
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
Interlude: SeqRecord attributes
seq: the sequence (Seq) itself
id: primary ID for the sequence, e.g. accession number(string)
name: “common” name/id for the sequence, like GenBankLOCUS id
description: human-readible description of the sequence
letter annotations: restricted dictionary of additional info aboutindividual letters in the sequence, e.g. quality scores
annotations: dictionary of additional unstructured info
features: list of SeqFeature objects with more structuredinformation — e.g. position of genes on a genome,domains on a protein sequence.
dbxrefs: list of database cross-references (strings)
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
from Bio import Entrez , SeqIOE n t r e z . e m a i l = "[email protected]"
h a n d l e = E n t r e z . e f e t c h ( db="nucleotide" , i d="M95169" ,r e t t y p e="gb" , retmode="text" )
r e c o r d = SeqIO . r e a d ( handle , "genbank" )h a n d l e . c l o s e ( )p r i n t r e c o r dp r i n t r e c o r d . f e a t u r e s [ 1 0 ]s l i c e d = r e c o r d [ 2 0 0 0 0 : ] # L a s t ˜25% o f t he genomep r i n t s l i c e d
from Bio . Seq import Seqfrom Bio . A lphabet import g e n e r i c p r o t e i nt r a n s l a t i o n s = [ f . q u a l i f i e r s [ "translation" ]
f o r f i n r e c o r d . f e a t u r e s [ 1 : ] ]p r o t e i n s = [ Seq ( t [ 0 ] , g e n e r i c p r o t e i n )
f o r t i n t r a n s l a t i o n s ]
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
NCBI Blast
BLAST can be used either standalone or through NCBI’s server.
Online: >>> from Bio.Blast import NCBIWWW
>>> result handle = NCBIWWW.qblast(
’blastp’, ’nr’, query string)
Standalone: “Legacy” (blastall):>>> from Bio.Blast.Applications import
BlastallCommandline
>>> help(BlastallCommandline)
New hotness (Blast+):>>> from Bio.Blast.Applications import
NcbiblastpCommandline
>>> help(NcbiblastpCommandline)
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
Parsing BLAST output
BLAST produces reports in plain-text and XML format.
Biopython requests XML by default.
>>> from Bio.Blast import NCBIWWW, NCBIXML
>>> result handle = NCBIWWW.qblast(’blastp’,
... ’nr’, query string)
>>> blast record = NCBIXML.read(result handle)
>>> print blast record
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
# Search f o r homologs o f a p r o t e i n s e q u e n c e
from Bio import SeqIOfrom Bio . B l a s t import NBCIWWW, NCBIXML
# Read and r e f o r m a t th e q u e r y s e q u e n c es e q r e c = SeqIO . r e a d (’gi2.gb’ , ’gb’ )q u e r y = s e q r e c . fo rmat (’fasta’ )
# Submit an o n l i n e BLAST q u e r y# ( This t a k e s some t ime to run )r e s u l t h a n d l e = NCBIWWW. q b l a s t (’blastx’ , ’nr’ , q u e r y )
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
# 1 . Save t he BLAST r e s u l t s as an XML f i l e
w i t h open (’aprotinin_blast.xml’ , ’w’ ) as s a v e f i l e :s a v e f i l e . w r i t e ( r e s u l t h a n d l e . r e a d ( ) )
r e s u l t h a n d l e . c l o s e ( )
# NB: The BLAST r e s u l t h a n d l e can o n l y be r e a d once# Reload i t from th e f i l ew i t h open (’aprotinin_blast.xml’ ) as r e s u l t h a n d l e :
b l a s t r e c o r d = NCBIXML . r e a d ( r e s u l t h a n d l e )
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
# 2 . D i s p l a y a h i s t o g r a m o f BLAST h i t s c o r e s
def g e t s c o r e s ( a l i g n m e n t s ) :f o r a l n i n a l i g n m e n t s :
f o r hsp i n a l n . h s p s :y i e l d hsp . s c o r e
s c o r e s = l i s t ( g e t s c o r e s ( b l a s t r e c o r d . a l i g n m e n t s ) )
# Draw t he h i s t o g r a mimport p y l a bp y l a b . h i s t ( s c o r e s , b i n s =20)p y l a b . t i t l e ("Scores of %d BLAST hits" % l e n ( s c o r e s ) )p y l a b . x l a b e l ("BLAST score" )p y l a b . y l a b e l ("# hits" )p y l a b . show ( )
# Save a copy f o r l a t e rp y l a b . s a v e f i g (’aprotinin_scores.png’ )
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
Figure: Histogram of BLAST scores generated by pylab
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
# 3 . E x t r a c t th e s e q u e n c e s o f h igh−s c o r i n g BLAST h i t s
from Bio . Seq import Seqfrom Bio . SeqRecord import SeqRecord
def g e t h s p s ( a l i g n m e n t s , t h r e s h o l d ) :f o r a l n i n a l i g n m e n t s :
f o r hsp i n a l n . h s p s :i f hsp . s c o r e >= t h r e s h o l d :
y i e l d SeqRecord ( Seq ( hsp . s b j c t ) ,i d=a l n . a c c e s s i o n )
break
b e s t s e q s = g e t h s p s ( b l a s t r e c o r d . a l i g n m e n t s , 321)SeqIO . w r i t e ( b e s t s e q s , ’aprotinin.fasta’ , ’fasta’ )
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs
Calling other external programs
Biopython has wrappers for other command-line programs in:
Bio.Blast.Applications — the Blast+ suite
Bio.Align.Applications — Muscle, ClustalW, . . .
Bio.Emboss.Applications — needle, water, . . .
Let’s re-align our BLAST results using Muscle, and format thealignment for use with stand-alone Phylip.
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
from Bio import A l i g n I Ofrom Bio . A l i g n . A p p l i c a t i o n s import MuscleCommandlinefrom S t r i n g I O import S t r i n g I O
# C o n s t r u c t th e s h e l l commandmuscle cmd = MuscleCommandline ( i n p u t="aprotinin.fasta" )# Execute the command# Get output ( the a l i g n m e n t ) and any e r r o r messagesm u s c l e o u t , m u s c l e e r r = muscle cmd ( )
# Read t he a l i g n m e n t back i na l i g n = A l i g n I O . r e a d ( S t r i n g I O ( m u s c l e o u t ) , "fasta" )
# Format th e a l i g n m e n t f o r P h y l i pA l i g n I O . w r i t e ( [ a l i g n ] , ’aprotinin.phy’ , ’phylip’ )
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Phylogenetics
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Phylogenetic tree I/O
Start with:>>> from Bio import Phylo
Input and output of trees is just like SeqIO:
read, parse single or multiple trees in Newick, Nexus andPhyloXML formats
write to any of the formats supported by read/parse
convert between two formats in one step
Use StringIO to load strings directly:>>> from cStringIO import StringIO
>>> handle = StringIO("((A,B),(C,(D,E)));")
>>> tree = Phylo.read(handle, "newick")
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
What’s in a tree?
Make a tree with branch lengths:>>> tree = Phylo.read(StringIO("((A:1,B:1):2,
... (C:2,(D:1,E:1):1):1);"), "newick")
View the object structure of the entire tree:>>> print tree
Draw an “ASCII-art” (plain text) representation:>>> Phylo.draw ascii(tree)
. . . OK, let’s do it properly now:>>> Phylo.draw(tree)
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Modify the tree
Check the tree object for its methods:>>> help(tree)
Try a few:>>> tree.get terminals()
>>> clade = tree.common ancestor("A", "B")
>>> clade.color = "red"
>>> tree.root with outgroup("D", "E")
>>> tree.ladderize()
>>> Phylo.draw(tree)
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
External applications
Biopython wraps a number of external programs for phylogenetics.We’re not going to use them now, but here’s where to find them:
Bio.Phylo.PAML — PAML wrappers & helpers
Bio.Phylo.Applications — command-line wrapper for PhyML(PhymlCommandline); RAxML and others on theway. (Anything you’d like to see sooner?)
Bio.Emboss.Applications — other tools ported via Embassy,including Phylip
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Proteinstructures
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Going 3D: The PDB module
Load a structure:
>>> from Bio import PDB
>>> parser = PDB.PDBParser()
>>> struct = parser.get structure(’1ATP’,
’1ATP.pdb’)
Inspect the object hierarchy:
>>> list(struct)
>>> model = struct[0]
>>> list(model)
>>> chain = model[’E’]
>>> list(chain)
>>> residue = chain[15]
>>> list(residue)
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Going 3D: The PDB module
Load a structure:
>>> from Bio import PDB
>>> parser = PDB.PDBParser()
>>> struct = parser.get structure(’1ATP’,
’1ATP.pdb’)
Inspect the object hierarchy:
>>> list(struct)
>>> model = struct[0]
>>> list(model)
>>> chain = model[’E’]
>>> list(chain)
>>> residue = chain[15]
>>> list(residue)
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Figure: The “SMCRA” object hierarchy
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Extracting a peptide sequence
Get the amino acid sequence through a Polypeptide object:
>>> from Bio import PDB
>>> parser = PDB.PDBParser()
>>> struct = parser.get structure(’1ATP’,
... ’1ATP.pdb’)
>>> ppb = PDB.PPBuilder()
>>> peptides = ppb.build peptides(struct)
>>> for pep in peptides:
... print pep.get sequence()
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Calculating RMSD
Given two aligned structures, filter a list of targetresidues for high RMS deviation.
Input: list of residue positions (integers)two equivalent chains from aligned proteinmodels — residue numbers must matchMinimum RMSD value (float)
Output: list of residue positions, filtered
Procedure: 1 Extract coordinates of Cα atoms2 If available (not glycine), extract Cβ
coordinates, too3 Use Bio.SVDSuperimposer to calculate the
RMSD between coordinates4 Compare to the given RMSD threshold
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
from Bio . SVDSuperimposer import SVDSuperimposerfrom numpy import a r r a y
def f i l t r m s ( r e s i d s , r e f c h a i n , cmpchain , t h r e s h = 0 . 5 ) :s u p e r = SVDSuperimposer ( )f o r r e s i n r e s i d s :
r e f r e s = r e f c h a i n [ r e s ]cmpres = cmpchain [ r e s ]coord1 = [ r e f r e s [ ’CA’ ] . g e t c o o r d ( ) ]coord2 = [ cmpres [ ’CA’ ] . g e t c o o r d ( ) ]i f r e f r e s . h a s i d (’CB’ ) and cmpres . h a s i d (’CB’ ) :
# Not g l y c i n ecoord1 . append ( r e f r e s [ ’CB’ ] . g e t c o o r d ( ) )coord2 . append ( cmp res [ ’CB’ ] . g e t c o o r d ( ) )
s u p e r . s e t ( a r r a y ( coord1 ) , a r r a y ( coord2 ) )rmsd = s u p e r . g e t i n i t r m s ( )i f rmsd >= t h r e s h o l d :
y i e l d r e s
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Figure: Superimposed structures, with selected deviating residues
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Further reading
Biopython tutorial:http:
//biopython.org/DIST/docs/tutorial/Tutorial.html
Biopython wiki:http://biopython.org/
This presentation:http://www.slideshare.net/etalevich/
biopython-programming-workshop-at-uga
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Thanks’Preciate it.
Gracias
Eric Talevich IOB Workshop: Biopython