10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By...
-
Upload
cecil-garrett -
Category
Documents
-
view
221 -
download
1
Transcript of 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By...
10/19/2015 BCHB524 - 2015
Protein StructureInformatics
using Bio.PDB
BCHB5242015
Lecture 12
By Edwards & Li
10/19/2015 BCHB524 - 2015
Outline
Review Python modules Biopython Sequence modules
Biopython’s Bio.PDB Protein structure primer / PyMOL PDB file parsing PDB data navigation: SMCRA Examples
10/19/2015 BCHB524 - 2015
Python Modules Review
Access the program environment sys, os, os.path
Specialized functions math, random
Access file-like resources as files: zipfile, gzip, urllib
Make specialized formats into “lists” and “dictionaries” csv (, XML, …)
10/19/2015 BCHB524 - 2015
BioPython Sequence Modules
Provide “sequence” abstraction More powerful than a python string
Knows its alphabet! Basic tasks already available
Easy parsing of (many) downloadable sequence database formats FASTA, Genbank, SwissProt/UniProt, etc… Simplify access to large collections of sequence Access by iteration, get sequence and accession Other content available as lists and dictionaries. Little semantic extraction or interpretation
Biopython Bio.SeqIO
Access to additional information annotations dictionary features list
Information, keys, and keywords vary with database! Semantic content extraction (still) up to you!
10/19/2015 BCHB524 - 2015
import Bio.SeqIOimport sysseqfile = open(sys.argv[1])for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"): print "\n------NEW SEQRECORD------\n" print "seq_record.annotations\n\t",seq_record.annotations print "seq_record.features\n\t",seq_record.features print "seq_record.dbxrefs\n\t",seq_record.dbxrefs print "seq_record.format('fasta')\n",seq_record.format('fasta')seqfile.close()
10/19/2015 BCHB524 - 2015
Proteins are…
…a linear sequence of amino-acids, after transcription from DNA, and translation from mRNA.
10/19/2015 BCHB524 - 2015
Proteins are…
…3-D molecules that interact with other (biological) molecules to carry out biological functions…
DNA Polymerase Hemoglobin
Proteins are…
www.uniprot.org
Primary Secondary Tertiary Quaternary
Introduction to Protein Structure, by Carl Branden and John Tooze
Proteins are Composed of 20 Amino Acids
Representations of Protein 3D Structures
Ball-Stick Ribbon Surface
Ball-Stick: Atom-Bond
Helix, Sheet and Random Coil
Solvent Accessible Surface
10/19/2015 BCHB524 - 2015
Protein Data Bank (PDB)
Repository of the 3-D conformation(s) / structure of proteins.
The result of laborious and expensive experiments using X-ray crystallography and/or nuclear magnetic resonance (NMR). (x,y,z) position of every atom of every amino-acid
Some entries contain multi-protein complexes, small-molecule ligands, docked epitopes and antibody-antigen complexes…
10/19/2015 BCHB524 - 2015
Visualization (PyMOL)
Molecular Visualization Tools PyMol: http://pymol.sourceforge.net/ Chimera: http://www.cgl.ucsf.edu/chimera/ PMV: http://www.scripps.edu/~sanner/python/ Coot: http://www.ysbl.york.ac.uk/~emsley/coot/ CCP4mg: http://www.ysbl.york.ac.uk/~lizp/molgraphics.html mmLib: http://pymmlib.sourceforge.net/ VMD: http://www.ks.uiuc.edu/Research/vmd/ MMTK: http://starship.python.net/crew/hinsen/MMTK/
http://biopython.org/
Overall Layout of a Structure Object
A structure consists of models
A model consists of chains
A chain consists of residues
A residue consists of atoms
http://biopython.org/
10/19/2015 BCHB524 - 2015
Biopython Bio.PDB
Parser for PDB format files Navigate structure and answer atom-atom
distance/angle questions.
Structure (PDB File) >> Model >> Chain >> Residue >> Atom >> (x,y,z) coordinates SMCRA representation mirrors PDB format
10/19/2015 BCHB524 - 2015
SMCRA Data-Model
Each PDB file represents one “structure” Each structure may contain many models
In most cases there is only one model, index 0. Each polypeptide (amino-acid sequence) is a
“chain”. A single-protein structure has one chain, “A” 1HPV is a dimer and has chains “A” and “B”.
10/19/2015 BCHB524 - 2015
SMCRA Data-Model
#eg1.pyimport Bio.PDB.PDBParserimport sys
# Use QUIET=True to avoid lots of warnings...parser = Bio.PDB.PDBParser(QUIET=True)structure = parser.get_structure("1HPV", "1HPV.pdb")model = structure[0]
# This structure is a dimer with two chainsachain = model['A']bchain = model['B']
10/19/2015 BCHB524 - 2015
SMCRA
Chains are composed of amino-acid residues Access by iteration, or by index Residue “index” may not be sequence position
Residues are composed of atoms: Access by iteration or by atom name …except for H!
Water molecules are also represented as atoms – HOH residue name, het=“W”
10/19/2015 BCHB524 - 2015
SMCRA Data-Model
#eg2.pyimport Bio.PDB.PDBParserimport sys
# Use QUIET=True to avoid lots of warnings...parser = Bio.PDB.PDBParser(QUIET=True)structure = parser.get_structure("1HPV", "1HPV.pdb")model = structure[0]
for chain in model: for residue in chain: for atom in residue: print chain, residue, atom, atom.get_coord()
10/19/2015 BCHB524 - 2015
Polypeptide molecules
S-G-Y-A-L
10/19/2015 BCHB524 - 2015
SMCRA Atom names
10/19/2015 BCHB524 - 2015
Check polypeptide backbone#eg3.pyimport Bio.PDB.PDBParserimport sys# Use QUIET=True to avoid lots of warnings...parser = Bio.PDB.PDBParser(QUIET=True)structure = parser.get_structure("1HPV", "1HPV.pdb")model = structure[0]achain = model['A']
for residue in achain: index = residue.get_id()[1] calpha = residue['CA'] carbon = residue['C'] nitrogen = residue['N'] oxygen = residue['O']
print "Residue:",residue.get_resname(),index print "N - Ca",(nitrogen - calpha) print "Ca - C ",(calpha - carbon) print "C - O ",(carbon - oxygen) print break
10/19/2015 BCHB524 - 2015
Check polypeptide backbone#eg4.py# As before...
for residue in achain: index = residue.get_id()[1] calpha = residue['CA'] carbon = residue['C'] nitrogen = residue['N'] oxygen = residue['O']
print "Residue:",residue.get_resname(),index print "N - Ca",(nitrogen - calpha) print "Ca - C ",(calpha - carbon) print "C - O ",(carbon - oxygen)
if achain.has_id(index+1): nextresidue = achain[index+1] nextnitrogen = nextresidue['N'] print "C - N ",(carbon - nextnitrogen)
10/19/2015 BCHB524 - 2015
Find potential disulfide bonds
The sulfur atoms of Cys amino-acids often form “di-sulfide” bonds if they are close enough – less than 8 Å.
Compare with PDB file contents: SSBOND
Bio.PDB does not provide an easy way to access the SSBOND annotations
10/19/2015 BCHB524 - 2015
Find potential disulfide bonds#eg5.pyimport Bio.PDB.PDBParserimport sys
# Use QUIET=True to avoid lots of warnings...parser = Bio.PDB.PDBParser(QUIET=True)structure = parser.get_structure("1KCW", "1KCW.pdb")model = structure[0]achain = model['A']
cysresidues = []for residue in achain: if residue.get_resname() == 'CYS': cysresidues.append(residue)
for c1 in cysresidues: c1index = c1.get_id()[1] for c2 in cysresidues: c2index = c2.get_id()[1] if (c1['SG'] - c2['SG']) < 8.0: print "possible di-sulfide bond:", print "Cys",c1index,"-", print "Cys",c2index, print round(c1['SG'] - c2['SG'],2)
10/19/2015 BCHB524 - 2015
Find contact residues in a dimer#eg6.pyimport Bio.PDB.PDBParserimport sys
# Use QUIET=True to avoid lots of warnings...parser = Bio.PDB.PDBParser(QUIET=True)structure = parser.get_structure("1HPV","1HPV.pdb")
achain = structure[0]['A']bchain = structure[0]['B']
for res1 in achain: if res1.get_id()[0][0] is 'H' or res1.get_id()[0][0] is 'W': continue r1ca = res1['CA'] r1ind = res1.get_id()[1] r1sym = res1.get_resname() for res2 in bchain: if res2.get_id()[0][0] is 'H' or res2.get_id()[0][0] is 'W': continue r2ca = res2['CA'] r2ind = res2.get_id()[1] r2sym = res2.get_resname() if (r1ca - r2ca) < 6.0: print "Residues",r1sym,r1ind,"in chain A", print "and",r2sym,r2ind,"in chain B", print "are close to each other:",round(r1ca-r2ca,2)
10/19/2015 BCHB524 - 2015
Find contact residues in a dimer – better version#eg7.pyimport Bio.PDB.PDBParserimport sys
# Use QUIET=True to avoid lots of warnings...parser = Bio.PDB.PDBParser(QUIET=True)structure = parser.get_structure("1HPV","1HPV.pdb")
achain = structure[0]['A']bchain = structure[0]['B']
bchainca = [ r['CA'] for r in bchain ]
neighbors = Bio.PDB.NeighborSearch(bchainca)
for res1 in achain: r1ca = res1['CA'] r1ind = res1.get_id()[1] r1sym = res1.get_resname() for r2ca in neighbors.search(r1ca.get_coord(), 6.0): res2 = r2ca.get_parent() r2ind = res2.get_id()[1] r2sym = res2.get_resname() print "Residues",r1sym,r1ind,"in chain A", print "and",r2sym,r2ind,"in chain B", print "are close to each other:",round(r1ca-r2ca,2)
10/19/2015 BCHB524 - 2015
Superimpose two structuresimport Bio.PDBimport Bio.PDB.PDBParserimport sys
# Use QUIET=True to avoid lots of warnings...parser = Bio.PDB.PDBParser(QUIET=True)structure1 = parser.get_structure("2WFJ","2WFJ.pdb")structure2 = parser.get_structure("2GW2","2GW2a.pdb")
ppb=Bio.PDB.PPBuilder()
# Manually figure out how the query and subject peptides correspond...# query has an extra residue at the front# subject has two extra residues at the backquery = ppb.build_peptides(structure1)[0][1:]target = ppb.build_peptides(structure2)[0][:-2]
query_atoms = [ r['CA'] for r in query ]target_atoms = [ r['CA'] for r in target ]
superimposer = Bio.PDB.Superimposer()superimposer.set_atoms(query_atoms, target_atoms)
print "Query and subject superimposed, RMS:", superimposer.rmssuperimposer.apply(structure2.get_atoms())
# Write modified structures to one fileoutfile=open("2GW2-modified.pdb", "w") io=Bio.PDB.PDBIO() io.set_structure(structure2) io.save(outfile) outfile.close()
10/19/2015 BCHB524 - 2015
Superimpose two chainsimport Bio.PDB
parser = Bio.PDB.PDBParser(QUIET=1)structure = parser.get_structure("1HPV","1HPV.pdb")
model = structure[0]
ppb=Bio.PDB.PPBuilder()
# Get the polypeptide chainsachain,bchain = ppb.build_peptides(model)
aatoms = [ r['CA'] for r in achain ]batoms = [ r['CA'] for r in bchain ]
superimposer = Bio.PDB.Superimposer()superimposer.set_atoms(aatoms, batoms)
print "Query and subject superimposed, RMS:", superimposer.rmssuperimposer.apply(model['B'].get_atoms())
# Write structure to fileoutfile=open("1HPV-modified.pdb", "w") io=Bio.PDB.PDBIO() io.set_structure(structure) io.save(outfile) outfile.close()
10/19/2015 BCHB524 - 2015
Exercises
Read through and try the examples from Chapter 10 of the Biopython Tutorial and the Bio.PDB FAQ.
Write a program that analyzes a PDB file (filename provided on the command-line!) to find pairs of lysine residues that might be linked if the BS3 cross-linker is used. The rigid BS3 cross-linker is approximately 11 Å long. Write two versions, one that computes the distance between
all pairs of lysine residues, and one that uses the NeighborSearch technique.