Biopython programming workshop at UGA

55
Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures IOB Workshop: Biopython A programming toolkit for bioinformatics Eric Talevich Institute of Bioinformatics, University of Georgia Mar. 29, 2012 Eric Talevich IOB Workshop: Biopython

description

A workshop on bioinformatics programming using Biopython and the Python programming language, held at the University of Georgia in Spring 2010 and 2012. These workshops are part of a series for the Institute of Bioinformatics (IoB) and Bioinformatics Grad Student Association (BIGSA) at UGA.

Transcript of Biopython programming workshop at UGA

Page 1: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

IOB Workshop: BiopythonA programming toolkit for bioinformatics

Eric Talevich

Institute of Bioinformatics, University of Georgia

Mar. 29, 2012

Eric Talevich IOB Workshop: Biopython

Page 2: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

Getting startedwith

Eric Talevich IOB Workshop: Biopython

Page 3: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

Installing Python

Biopython is a library for the Python programming language.

First, you’ll need these installed:

Python 2.7 from http://python.org. It may already beinstalled on your computer. (Version 2.6 is OK, too.)

IDLE, a simple Integrated DeveLopment Environment.Usually bundled with the Python distribution.

Now, start an interactive session in IDLE. 1

1On your own, check out IPython (http://ipython.scipy.org/). It’s anenhanced Python interpreter that feels somewhat like R.

Eric Talevich IOB Workshop: Biopython

Page 4: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

Installing Python packages

Biopython is a Python package. There are a few standard ways toinstall Python packages:

From source: Download from PyPI 2, unpack and install with theincluded setup.py script.

easy install: Install from source 3, then use the easy install

command to fetch install all other packages by name:$ easy install <package name>

pip: Like easy install, use pip 4 to manage packages:$ pip install <package name>

2http://pypi.python.org/pypi/3http://pypi.python.org/pypi/setuptools4http://pypi.python.org/pypi/pip

Eric Talevich IOB Workshop: Biopython

Page 5: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

Installing NumPy, matplotlib and Biopython

Biopython relies on a few other Python packages for extrafunctionality. We’ll use these:

numpy — efficient numerical functions and data structures(for Bio.PDB)

matplotlib — plotting (for Bio.Phylo)

Then finally:

biopython — the reason we’re here today

(Biopython, NumPy, matplotlib, setuptools and pip are also packaged for

many Linux distributions.)

Eric Talevich IOB Workshop: Biopython

Page 6: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

Testing

Check your Biopython installation:

>>> import Bio

>>> print Bio. version

Import a NumPy-based component:

>>> from Bio import PDB

Show a simple plot:

>>> from matplotlib import pyplot

>>> pyplot.plot(range(5), range(5))

>>> pyplot.show()

Eric Talevich IOB Workshop: Biopython

Page 7: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

Let’s start using

Eric Talevich IOB Workshop: Biopython

Page 8: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

Biopython1 Sequences and alignments

The Seq objectSeqIO and the SeqRecord object

2 NCBI EUtils and BLASTEUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

3 Phylogenetics

4 Protein structures

Eric Talevich IOB Workshop: Biopython

Page 9: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

The Seq objectSeqIO and the SeqRecord object

Sequencesand

Alignments

Eric Talevich IOB Workshop: Biopython

Page 10: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

The Seq objectSeqIO and the SeqRecord object

The Seq object

>>> from Bio.Seq import Seq

>>> myseq = Seq(’AGTACACTGGT’)

>>> myseq

Seq(’AGTACACTGGT’, Alphabet())

>>> print myseq

AGTACACTGGT

>>> myseq.transcribe()

Seq(’AGUACACUGGU’, RNAAlphabet())

>>> myseq.translate()

Seq(’STL’, ExtendedIUPACProtein())

Eric Talevich IOB Workshop: Biopython

Page 11: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

The Seq objectSeqIO and the SeqRecord object

A Seq object consists of:

data — the underlying Python character string

alphabet — DNA, RNA, protein, etc.

It supports most Python string methods:>>> myseq.count(’GT’)

2

And some biology-specific methods, too:>>> myseq.reverse complement()

Seq(’ACCAGTGTACT’, Alphabet())

Intrigued? Read on:>>> help(Seq)

Eric Talevich IOB Workshop: Biopython

Page 12: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

The Seq objectSeqIO and the SeqRecord object

SeqIO: Sequence Input/Output

Sequence data is stored in many different file formats.Bio.SeqIO supports:

abi fastq phylip swissace genbank pir tab

clustal ig qual uniprot-xmlembl imgt seqxml

emboss nexus sfffasta phd stockholm

Manually fetch some data from the PDB website: 5

1ATP.fasta — two protein sequences, FASTA format

1ATP.pdb — the 3D structure, for later

5http://www.rcsb.org/pdb/explore/explore.do?structureId=1ATP

Eric Talevich IOB Workshop: Biopython

Page 13: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

The Seq objectSeqIO and the SeqRecord object

The SeqIO API

SeqIO provides four functions:

parse: Iteratively parse all elements in the file

read: Parse a one-element file and return the element

write: Write elements to a file

convert: Parse one format and immediately write another

Biopython uses the same I/O conventions for alignments(AlignIO), BLAST results (Blast), and phylogenetic trees(Phylo).

Eric Talevich IOB Workshop: Biopython

Page 14: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

The Seq objectSeqIO and the SeqRecord object

The SeqRecord object

SeqIO.parse returns SeqRecords.SeqRecord wraps a Seq object and attaches metadata.

1 Pass the file name to the SeqIO parser; specify FASTA format:from Bio import SeqIO

seqrecs = SeqIO.parse("1ATP.fasta", "fasta")

print seqrecs

2 To see all records at once, convert the iterator to a list:allrecs = list(seqrecs)

print allrecs[0]

print allrecs[0].seq

Eric Talevich IOB Workshop: Biopython

Page 15: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

The Seq objectSeqIO and the SeqRecord object

The SeqRecord object

SeqIO.parse returns SeqRecords.SeqRecord wraps a Seq object and attaches metadata.

1 Pass the file name to the SeqIO parser; specify FASTA format:from Bio import SeqIO

seqrecs = SeqIO.parse("1ATP.fasta", "fasta")

print seqrecs

2 To see all records at once, convert the iterator to a list:allrecs = list(seqrecs)

print allrecs[0]

print allrecs[0].seq

Eric Talevich IOB Workshop: Biopython

Page 16: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

The Seq objectSeqIO and the SeqRecord object

Example: Shuffled sequences

Given a real DNA sequence, create a “background” set ofrandomized sequences with the same composition.

Procedure:

1 Read the source sequence from a file– Use Bio.SeqIO

2 In a loop:

Shuffle the sequence– Use random.shuffle from Python’s standard libraryCreate a new SeqRecord from the shuffled sequence– Because SeqIO.write works with SeqRecords

3 Write the shuffled SeqRecords to another file

Eric Talevich IOB Workshop: Biopython

Page 17: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

The Seq objectSeqIO and the SeqRecord object

Example: Shuffled sequences

Given a real DNA sequence, create a “background” set ofrandomized sequences with the same composition.

Procedure:

1 Read the source sequence from a file– Use Bio.SeqIO

2 In a loop:

Shuffle the sequence– Use random.shuffle from Python’s standard libraryCreate a new SeqRecord from the shuffled sequence– Because SeqIO.write works with SeqRecords

3 Write the shuffled SeqRecords to another file

Eric Talevich IOB Workshop: Biopython

Page 18: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

The Seq objectSeqIO and the SeqRecord object

Example: Shuffled sequences

Given a real DNA sequence, create a “background” set ofrandomized sequences with the same composition.

Procedure:

1 Read the source sequence from a file– Use Bio.SeqIO

2 In a loop:

Shuffle the sequence– Use random.shuffle from Python’s standard libraryCreate a new SeqRecord from the shuffled sequence– Because SeqIO.write works with SeqRecords

3 Write the shuffled SeqRecords to another file

Eric Talevich IOB Workshop: Biopython

Page 19: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

The Seq objectSeqIO and the SeqRecord object

import randomfrom Bio import SeqIOfrom Bio . Seq import Seqfrom Bio . SeqRecord import SeqRecord

o r i g r e c = SeqIO . r e a d ("gi2.gb" , "genbank" )a l p h a b e t = o r i g r e c . seq . a l p h a b e to u t r e c s = [ ]f o r i i n x r a n g e ( 1 , 3 1 ) :

n u c l e o t i d e s = l i s t ( o r i g r e c . seq )random . s h u f f l e ( n u c l e o t i d e s )new seq = Seq ("" . j o i n ( n u c l e o t i d e s ) , a l p h a b e t )n e w r e c = SeqRecord ( new seq ,

i d="shuffle" + s t r ( i ) )o u t r e c s . append ( n e w r e c )

SeqIO . w r i t e ( o u t r e c s , "gi2_shuffled.fasta" , "fasta" )

Eric Talevich IOB Workshop: Biopython

Page 20: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

The Seq objectSeqIO and the SeqRecord object

Example: ORF translation

Split a set of unannotated DNA sequences into uniqueORFs, translating in all 6 frames.

Biopython can help with each piece of this problem:

1 Parse the given unannotated DNA sequences (SeqIO.parse)

2 Get the template strand’s sequence (Seq.reverse complement)

3 Translate both strands into protein sequences (Seq.translate)

4 Shift each strand by +1 and +2 for alternate reading frames(string-like Seq slicing)

5 Split sequences at stop codons (Seq.split(’*’))

6 Write translated sequences to a new file (SeqIO.write)

Eric Talevich IOB Workshop: Biopython

Page 21: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

The Seq objectSeqIO and the SeqRecord object

def t r a n s l a t e s i x f r a m e s ( seq , t a b l e =1):””” T r a n s l a t e a n u c l e o t i d e s e q u e n c e i n 6 f r a m e s .

R e t u r n s an i t e r a b l e o f 6 t r a n s l a t e d p r o t e i ns e q u e n c e s .”””r e v = seq . r e v e r s e c o m p l e m e n t ( )f o r i i n ra ng e ( 3 ) :

# Coding ( C r i c k ) s t r a n dy i e l d seq [ i : ] . t r a n s l a t e ( t a b l e )

# Template ( Watson ) s t r a n dy i e l d r e v [ i : ] . t r a n s l a t e ( t a b l e )

Eric Talevich IOB Workshop: Biopython

Page 22: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

The Seq objectSeqIO and the SeqRecord object

def t r a n s l a t e o r f s ( sequences , m i n p r o t l e n =60):””” Find and t r a n s l a t e a l l ORFs i n s e q u e n c e s .

T r a n s l a t e s each s e q u e n c e i n a l l 6 r e a d i n g frames ,s p l i t s s e q u e n c e s on s t o p codons , and p r o d u c e s ani t e r a b l e o f a l l p r o t e i n s e q u e n c e s o f l e n g t h a tl e a s t m i n p r o t l e n .”””f o r seq i n s e q u e n c e s :

f o r f rame i n t r a n s l a t e s i x f r a m e s ( seq ) :f o r p r o t i n f rame . s p l i t ("*" ) :

i f l e n ( p r o t ) >= m i n p r o t l e n :y i e l d p r o t

Eric Talevich IOB Workshop: Biopython

Page 23: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

The Seq objectSeqIO and the SeqRecord object

from Bio import SeqIOfrom Bio . SeqRecord import SeqRecord

i f n a m e == "__main__" :import s y si n f i l e = s y s . s t d i no u t f i l e = s y s . s t d o u tr e c o r d s = SeqIO . p a r s e ( i n f i l e , "fasta" )s e q s = ( r e c . seq f o r r e c i n r e c o r d s )p r o t e i n s = t r a n s l a t e o r f s ( s e q s )s e q r e c s = ( SeqRecord ( seq , i d="orf"+s t r ( i ) )

f o r i , seq i n enumerate ( o r f s ) )SeqIO . w r i t e ( s r e c s , o u t f i l e , "fasta" )

Eric Talevich IOB Workshop: Biopython

Page 24: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

The Seq objectSeqIO and the SeqRecord object

AlignIO and the Alignment object

Alignment: a set of sequences with the same length and alphabet.

Use AlignIO just like SeqIO:>>> from Bio import AlignIO

>>> aln = AlignIO.read("PF01601.sto", "stockholm")

>>> print alnSingleLetterAlphabet() alignment with 22 rows and 730 columns

NCTDAV-----LTYSSFGVCADGSIIA-VQPRNV-----SYDSV...HIQ Q1HVL3 CVH22/539-1170

NCTTAV-----MTYSNFGICADGSLIP-VRPRNS-----SDNGI...HVQ SPIKE CVHNL/723-1356

NCTEPV-----LVYSNIGVCKSGSIGY-VPSQS------GQVKI...HVQ Q692M1 9CORO/740-1383

NCTEPA-----LVYSNIGVCKNGAIGL-VGIRN------TQPKI...HIQ Q0Q4F4 9CORO/729-1360

NCTSPR-----LVYSNIGVCTSGAIGL-LSPKX------AQPQI...HVQ Q0Q4F6 9CORO/743-1371

NCTNPV-----LTYSSYGVCPDGSITR-LGLTD------VQPHF...--T A4ULL0 9CORO/726-1328

NCTKPV-----LSYGPISVCSDGAIAG-TSTLQN-----TRPSI...KEW A6N263 9CORO/406-1035

ECDIPIGAGICASYHTVSLLRSTSQKSIVAYTMS------LGAD...HYT Q6T7X8 CVHSA/647-1255

...

DCE-PV-----ITYSNIGVCKNGAFVF-INVTH------SDGDV...HVH Q0PKZ5 CVPPU/797-1449

Eric Talevich IOB Workshop: Biopython

Page 25: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

The Seq objectSeqIO and the SeqRecord object

Snack Time

Eric Talevich IOB Workshop: Biopython

Page 26: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

EUtils and BLAST

Eric Talevich IOB Workshop: Biopython

Page 27: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

EUtils: Entrez Programming Utilities

Access NCBI’s online services:from Bio import Entrez

Entrez.email = "[email protected]"

Request a GenBank record:handle = Entrez.efetch(db="protein", id="69316",

rettype="gb", retmode="text")

record = SeqIO.read(handle, "gb")

Specify multiple IDs in one query:handle = Entrez.efetch(db="protein",

id="349839,349840",

rettype="fasta", retmode="text")

records = SeqIO.parse(handle, "fasta")

Eric Talevich IOB Workshop: Biopython

Page 28: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

EUtils: Entrez Programming Utilities

Access NCBI’s online services:from Bio import Entrez

Entrez.email = "[email protected]"

Request a GenBank record:handle = Entrez.efetch(db="protein", id="69316",

rettype="gb", retmode="text")

record = SeqIO.read(handle, "gb")

Specify multiple IDs in one query:handle = Entrez.efetch(db="protein",

id="349839,349840",

rettype="fasta", retmode="text")

records = SeqIO.parse(handle, "fasta")

Eric Talevich IOB Workshop: Biopython

Page 29: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

EUtils: Entrez Programming Utilities

Access NCBI’s online services:from Bio import Entrez

Entrez.email = "[email protected]"

Request a GenBank record:handle = Entrez.efetch(db="protein", id="69316",

rettype="gb", retmode="text")

record = SeqIO.read(handle, "gb")

Specify multiple IDs in one query:handle = Entrez.efetch(db="protein",

id="349839,349840",

rettype="fasta", retmode="text")

records = SeqIO.parse(handle, "fasta")

Eric Talevich IOB Workshop: Biopython

Page 30: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

Interlude: SeqRecord attributes

seq: the sequence (Seq) itself

id: primary ID for the sequence, e.g. accession number(string)

name: “common” name/id for the sequence, like GenBankLOCUS id

description: human-readible description of the sequence

letter annotations: restricted dictionary of additional info aboutindividual letters in the sequence, e.g. quality scores

annotations: dictionary of additional unstructured info

features: list of SeqFeature objects with more structuredinformation — e.g. position of genes on a genome,domains on a protein sequence.

dbxrefs: list of database cross-references (strings)

Eric Talevich IOB Workshop: Biopython

Page 31: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

from Bio import Entrez , SeqIOE n t r e z . e m a i l = "[email protected]"

h a n d l e = E n t r e z . e f e t c h ( db="nucleotide" , i d="M95169" ,r e t t y p e="gb" , retmode="text" )

r e c o r d = SeqIO . r e a d ( handle , "genbank" )h a n d l e . c l o s e ( )p r i n t r e c o r dp r i n t r e c o r d . f e a t u r e s [ 1 0 ]s l i c e d = r e c o r d [ 2 0 0 0 0 : ] # L a s t ˜25% o f t he genomep r i n t s l i c e d

from Bio . Seq import Seqfrom Bio . A lphabet import g e n e r i c p r o t e i nt r a n s l a t i o n s = [ f . q u a l i f i e r s [ "translation" ]

f o r f i n r e c o r d . f e a t u r e s [ 1 : ] ]p r o t e i n s = [ Seq ( t [ 0 ] , g e n e r i c p r o t e i n )

f o r t i n t r a n s l a t i o n s ]

Eric Talevich IOB Workshop: Biopython

Page 32: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

NCBI Blast

BLAST can be used either standalone or through NCBI’s server.

Online: >>> from Bio.Blast import NCBIWWW

>>> result handle = NCBIWWW.qblast(

’blastp’, ’nr’, query string)

Standalone: “Legacy” (blastall):>>> from Bio.Blast.Applications import

BlastallCommandline

>>> help(BlastallCommandline)

New hotness (Blast+):>>> from Bio.Blast.Applications import

NcbiblastpCommandline

>>> help(NcbiblastpCommandline)

Eric Talevich IOB Workshop: Biopython

Page 33: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

Parsing BLAST output

BLAST produces reports in plain-text and XML format.

Biopython requests XML by default.

>>> from Bio.Blast import NCBIWWW, NCBIXML

>>> result handle = NCBIWWW.qblast(’blastp’,

... ’nr’, query string)

>>> blast record = NCBIXML.read(result handle)

>>> print blast record

Eric Talevich IOB Workshop: Biopython

Page 34: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

# Search f o r homologs o f a p r o t e i n s e q u e n c e

from Bio import SeqIOfrom Bio . B l a s t import NBCIWWW, NCBIXML

# Read and r e f o r m a t th e q u e r y s e q u e n c es e q r e c = SeqIO . r e a d (’gi2.gb’ , ’gb’ )q u e r y = s e q r e c . fo rmat (’fasta’ )

# Submit an o n l i n e BLAST q u e r y# ( This t a k e s some t ime to run )r e s u l t h a n d l e = NCBIWWW. q b l a s t (’blastx’ , ’nr’ , q u e r y )

Eric Talevich IOB Workshop: Biopython

Page 35: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

# 1 . Save t he BLAST r e s u l t s as an XML f i l e

w i t h open (’aprotinin_blast.xml’ , ’w’ ) as s a v e f i l e :s a v e f i l e . w r i t e ( r e s u l t h a n d l e . r e a d ( ) )

r e s u l t h a n d l e . c l o s e ( )

# NB: The BLAST r e s u l t h a n d l e can o n l y be r e a d once# Reload i t from th e f i l ew i t h open (’aprotinin_blast.xml’ ) as r e s u l t h a n d l e :

b l a s t r e c o r d = NCBIXML . r e a d ( r e s u l t h a n d l e )

Eric Talevich IOB Workshop: Biopython

Page 36: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

# 2 . D i s p l a y a h i s t o g r a m o f BLAST h i t s c o r e s

def g e t s c o r e s ( a l i g n m e n t s ) :f o r a l n i n a l i g n m e n t s :

f o r hsp i n a l n . h s p s :y i e l d hsp . s c o r e

s c o r e s = l i s t ( g e t s c o r e s ( b l a s t r e c o r d . a l i g n m e n t s ) )

# Draw t he h i s t o g r a mimport p y l a bp y l a b . h i s t ( s c o r e s , b i n s =20)p y l a b . t i t l e ("Scores of %d BLAST hits" % l e n ( s c o r e s ) )p y l a b . x l a b e l ("BLAST score" )p y l a b . y l a b e l ("# hits" )p y l a b . show ( )

# Save a copy f o r l a t e rp y l a b . s a v e f i g (’aprotinin_scores.png’ )

Eric Talevich IOB Workshop: Biopython

Page 37: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

Figure: Histogram of BLAST scores generated by pylab

Eric Talevich IOB Workshop: Biopython

Page 38: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

# 3 . E x t r a c t th e s e q u e n c e s o f h igh−s c o r i n g BLAST h i t s

from Bio . Seq import Seqfrom Bio . SeqRecord import SeqRecord

def g e t h s p s ( a l i g n m e n t s , t h r e s h o l d ) :f o r a l n i n a l i g n m e n t s :

f o r hsp i n a l n . h s p s :i f hsp . s c o r e >= t h r e s h o l d :

y i e l d SeqRecord ( Seq ( hsp . s b j c t ) ,i d=a l n . a c c e s s i o n )

break

b e s t s e q s = g e t h s p s ( b l a s t r e c o r d . a l i g n m e n t s , 321)SeqIO . w r i t e ( b e s t s e q s , ’aprotinin.fasta’ , ’fasta’ )

Eric Talevich IOB Workshop: Biopython

Page 39: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

Calling other external programs

Biopython has wrappers for other command-line programs in:

Bio.Blast.Applications — the Blast+ suite

Bio.Align.Applications — Muscle, ClustalW, . . .

Bio.Emboss.Applications — needle, water, . . .

Let’s re-align our BLAST results using Muscle, and format thealignment for use with stand-alone Phylip.

Eric Talevich IOB Workshop: Biopython

Page 40: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

from Bio import A l i g n I Ofrom Bio . A l i g n . A p p l i c a t i o n s import MuscleCommandlinefrom S t r i n g I O import S t r i n g I O

# C o n s t r u c t th e s h e l l commandmuscle cmd = MuscleCommandline ( i n p u t="aprotinin.fasta" )# Execute the command# Get output ( the a l i g n m e n t ) and any e r r o r messagesm u s c l e o u t , m u s c l e e r r = muscle cmd ( )

# Read t he a l i g n m e n t back i na l i g n = A l i g n I O . r e a d ( S t r i n g I O ( m u s c l e o u t ) , "fasta" )

# Format th e a l i g n m e n t f o r P h y l i pA l i g n I O . w r i t e ( [ a l i g n ] , ’aprotinin.phy’ , ’phylip’ )

Eric Talevich IOB Workshop: Biopython

Page 41: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

Phylogenetics

Eric Talevich IOB Workshop: Biopython

Page 42: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

Phylogenetic tree I/O

Start with:>>> from Bio import Phylo

Input and output of trees is just like SeqIO:

read, parse single or multiple trees in Newick, Nexus andPhyloXML formats

write to any of the formats supported by read/parse

convert between two formats in one step

Use StringIO to load strings directly:>>> from cStringIO import StringIO

>>> handle = StringIO("((A,B),(C,(D,E)));")

>>> tree = Phylo.read(handle, "newick")

Eric Talevich IOB Workshop: Biopython

Page 43: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

What’s in a tree?

Make a tree with branch lengths:>>> tree = Phylo.read(StringIO("((A:1,B:1):2,

... (C:2,(D:1,E:1):1):1);"), "newick")

View the object structure of the entire tree:>>> print tree

Draw an “ASCII-art” (plain text) representation:>>> Phylo.draw ascii(tree)

. . . OK, let’s do it properly now:>>> Phylo.draw(tree)

Eric Talevich IOB Workshop: Biopython

Page 44: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

Modify the tree

Check the tree object for its methods:>>> help(tree)

Try a few:>>> tree.get terminals()

>>> clade = tree.common ancestor("A", "B")

>>> clade.color = "red"

>>> tree.root with outgroup("D", "E")

>>> tree.ladderize()

>>> Phylo.draw(tree)

Eric Talevich IOB Workshop: Biopython

Page 45: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

External applications

Biopython wraps a number of external programs for phylogenetics.We’re not going to use them now, but here’s where to find them:

Bio.Phylo.PAML — PAML wrappers & helpers

Bio.Phylo.Applications — command-line wrapper for PhyML(PhymlCommandline); RAxML and others on theway. (Anything you’d like to see sooner?)

Bio.Emboss.Applications — other tools ported via Embassy,including Phylip

Eric Talevich IOB Workshop: Biopython

Page 46: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

Proteinstructures

Eric Talevich IOB Workshop: Biopython

Page 47: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

Going 3D: The PDB module

Load a structure:

>>> from Bio import PDB

>>> parser = PDB.PDBParser()

>>> struct = parser.get structure(’1ATP’,

’1ATP.pdb’)

Inspect the object hierarchy:

>>> list(struct)

>>> model = struct[0]

>>> list(model)

>>> chain = model[’E’]

>>> list(chain)

>>> residue = chain[15]

>>> list(residue)

Eric Talevich IOB Workshop: Biopython

Page 48: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

Going 3D: The PDB module

Load a structure:

>>> from Bio import PDB

>>> parser = PDB.PDBParser()

>>> struct = parser.get structure(’1ATP’,

’1ATP.pdb’)

Inspect the object hierarchy:

>>> list(struct)

>>> model = struct[0]

>>> list(model)

>>> chain = model[’E’]

>>> list(chain)

>>> residue = chain[15]

>>> list(residue)

Eric Talevich IOB Workshop: Biopython

Page 49: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

Figure: The “SMCRA” object hierarchy

Eric Talevich IOB Workshop: Biopython

Page 50: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

Extracting a peptide sequence

Get the amino acid sequence through a Polypeptide object:

>>> from Bio import PDB

>>> parser = PDB.PDBParser()

>>> struct = parser.get structure(’1ATP’,

... ’1ATP.pdb’)

>>> ppb = PDB.PPBuilder()

>>> peptides = ppb.build peptides(struct)

>>> for pep in peptides:

... print pep.get sequence()

Eric Talevich IOB Workshop: Biopython

Page 51: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

Calculating RMSD

Given two aligned structures, filter a list of targetresidues for high RMS deviation.

Input: list of residue positions (integers)two equivalent chains from aligned proteinmodels — residue numbers must matchMinimum RMSD value (float)

Output: list of residue positions, filtered

Procedure: 1 Extract coordinates of Cα atoms2 If available (not glycine), extract Cβ

coordinates, too3 Use Bio.SVDSuperimposer to calculate the

RMSD between coordinates4 Compare to the given RMSD threshold

Eric Talevich IOB Workshop: Biopython

Page 52: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

from Bio . SVDSuperimposer import SVDSuperimposerfrom numpy import a r r a y

def f i l t r m s ( r e s i d s , r e f c h a i n , cmpchain , t h r e s h = 0 . 5 ) :s u p e r = SVDSuperimposer ( )f o r r e s i n r e s i d s :

r e f r e s = r e f c h a i n [ r e s ]cmpres = cmpchain [ r e s ]coord1 = [ r e f r e s [ ’CA’ ] . g e t c o o r d ( ) ]coord2 = [ cmpres [ ’CA’ ] . g e t c o o r d ( ) ]i f r e f r e s . h a s i d (’CB’ ) and cmpres . h a s i d (’CB’ ) :

# Not g l y c i n ecoord1 . append ( r e f r e s [ ’CB’ ] . g e t c o o r d ( ) )coord2 . append ( cmp res [ ’CB’ ] . g e t c o o r d ( ) )

s u p e r . s e t ( a r r a y ( coord1 ) , a r r a y ( coord2 ) )rmsd = s u p e r . g e t i n i t r m s ( )i f rmsd >= t h r e s h o l d :

y i e l d r e s

Eric Talevich IOB Workshop: Biopython

Page 53: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

Figure: Superimposed structures, with selected deviating residues

Eric Talevich IOB Workshop: Biopython

Page 54: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

Further reading

Biopython tutorial:http:

//biopython.org/DIST/docs/tutorial/Tutorial.html

Biopython wiki:http://biopython.org/

This presentation:http://www.slideshare.net/etalevich/

biopython-programming-workshop-at-uga

Eric Talevich IOB Workshop: Biopython

Page 55: Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

Thanks’Preciate it.

Gracias

Eric Talevich IOB Workshop: Biopython