Chem 731 – Computer methods for studying protein structure...

91
Chem 731 – Computer methods for studying protein structure and function These are the coursenotes for the ongoing graduate course. I will post updated versions as I go along. These course notes will contain the slides I show in class, as well as additional notes and explanations. Please let me know if something is unclear, so that I can improve these notes. This version is from Wednesday 4 th December, 2013.

Transcript of Chem 731 – Computer methods for studying protein structure...

Page 1: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

Chem 731 – Computer methods for studying protein structure and

function

These are the coursenotes for the ongoing graduate course. I will post updated versionsas I go along. These course notes will contain the slides I show in class, as well asadditional notes and explanations.

Please let me know if something is unclear, so that I can improve these notes.

This version is from Wednesday 4th December, 2013.

Page 2: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

Contents

1 Introduction 1

1.1 Overview 11.1.1 Before we begin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.2 Course topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Linux 21.2.1 What is Linux? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Why use Linux, if it is a pain in the behind? . . . . . . . . . . . . . . . . . . 21.2.3 Recommended Linux distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.4 Free Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.5 Web resources for Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.6 Let’s install Linux, if we haven’t yet . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Using the shell 3

1.4 The bash shell 41.4.1 Some basic shell commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4.2 Becoming the super user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4.3 Installing some software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4.4 Fortune cookies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4.5 Throwing fortune cookies into black holes . . . . . . . . . . . . . . . . . . . 61.4.6 Saving fortune cookies for posterity . . . . . . . . . . . . . . . . . . . . . . . . 61.4.7 Saving more fortune cookies for posterity . . . . . . . . . . . . . . . . . . . . 61.4.8 Our own cookie factory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4.9 A fancier cookie factory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4.10 Viewing documentation with man . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4.11 Searching documentation with apropos . . . . . . . . . . . . . . . . . . . . . 8

2 LATEX 9

2.1 Prerequisites 9

2.2 Overview 92.2.1 What is LATEX? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 Example LATEX markup (source code of slide above) . . . . . . . . . . . . . 10

ii

Page 3: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

CONTENTS iii

2.2.3 “Logical markup”: Separating content from presentation . . . . . . . . 10

2.3 Examples and exercises 112.3.1 The source file for this presentation . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2 Should I use LATEX or a word processor? . . . . . . . . . . . . . . . . . . . . . . 112.3.3 Exercise 1: Create a LATEX document . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.4 Excercise 1 ctd. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.5 Exercise 2: Use the UW thesis template . . . . . . . . . . . . . . . . . . . . . . 13

3 Gnuplot 14

3.1 Installation 14

3.2 Introduction 14

3.3 Plotting functions and files 153.3.1 Start Gnuplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3.2 Running a gnuplot script file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3.3 Saving a plot to file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3.4 Including a plot in LATEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3.5 Plotting data files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3.6 Working with data files in different formats . . . . . . . . . . . . . . . . . . 163.3.7 Plotting CSV files; multiple data sets . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4 Curve fitting with Gnuplot 173.4.1 Theories without adjustable parameters . . . . . . . . . . . . . . . . . . . . . 173.4.2 Theories with adjustable parameters . . . . . . . . . . . . . . . . . . . . . . . . 183.4.3 Numerical curve fitting by gradient descent . . . . . . . . . . . . . . . . . . 193.4.4 Example: Receptor activation by ligand . . . . . . . . . . . . . . . . . . . . . . 193.4.5 The 5-HT2B receptor can be up- and down-regulated by ligands . . . 203.4.6 Receptor activation or inhibition by ligand – theory . . . . . . . . . . . . 203.4.7 How many variable parameters should we use? . . . . . . . . . . . . . . . 213.4.8 How good is the fit? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4.9 Testing exact theories with inexact data . . . . . . . . . . . . . . . . . . . . . 223.4.10 Testing a theory with adjustable parameters . . . . . . . . . . . . . . . . . . 233.4.11 Evaluating the fit error: χ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4.12 How do we obtain the standard deviations of the measured values? 233.4.13 A practical exercise: Calcium binding to daptomycin . . . . . . . . . . . 243.4.14 What daptomycin is supposed to do . . . . . . . . . . . . . . . . . . . . . . . . 243.4.15 One or more types of binding sites for calcium? . . . . . . . . . . . . . . . 253.4.16 Daptomycin fluorescence after addition of EDTA at t = 0 . . . . . . . 253.4.17 A single-exponential model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4.18 Fitting with 1 to 4 exponential terms . . . . . . . . . . . . . . . . . . . . . . . . 263.4.19 Where are the parameters obtained from the fit? . . . . . . . . . . . . . . 263.4.20 Which fit is the best? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4.21 Plotting the fit residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.4.22 Residuals from a good fit (4 exponentials) . . . . . . . . . . . . . . . . . . . . 273.4.23 Residuals from a poor fit (2 exponentials) . . . . . . . . . . . . . . . . . . . . 273.4.24 So have we found the truth? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.5 Code and data listings 28

Page 4: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

iv CONTENTS

4 Protein structure visualization with Jmol and Pymol 34

4.1 Introduction 344.1.1 Why X-rays? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.1.2 Is it easy? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.1.3 Protein structure databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.1.4 Protein structure family relations . . . . . . . . . . . . . . . . . . . . . . . . . . 354.1.5 The PDB data format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.1.6 Software for molecular visualiation . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Jmol 364.2.1 Jmol exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2.2 The PDB file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2.3 The fields of the ATOM record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2.4 A hetero-atom record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2.5 Tweaking the view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2.6 Saving our hard work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2.7 Saving the current state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2.8 Saving images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2.9 Looking at protein folds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2.10 Folds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2.11 More on selections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.2.12 Exercise: Try to reproduce this display . . . . . . . . . . . . . . . . . . . . . . 404.2.13 Hints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.2.14 And another one . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2.15 And a last one . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Pymol 424.3.1 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3.2 The GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3.3 Opening files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3.4 Working with single structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3.5 Exercise: HIV protease with the inhibitor saquinavir bound to it . . 434.3.6 What are virus proteases, anyway? . . . . . . . . . . . . . . . . . . . . . . . . . . 444.3.7 Saving a cleaned-up version of the molecule . . . . . . . . . . . . . . . . . . 444.3.8 Visualizing structure elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.3.9 Saving state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.3.10 Selections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.3.11 Prettyfication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3.12 Producing high-quality figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3.13 Driving Pymol with scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3.14 The image produced by gyrase.pml . . . . . . . . . . . . . . . . . . . . . . . . . 474.3.15 What is DNA topoisomerase anyway? . . . . . . . . . . . . . . . . . . . . . . . 484.3.16 The reaction catalyzed by DNA topoisomerases . . . . . . . . . . . . . . . 484.3.17 Understanding script files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Page 5: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

CONTENTS v

5 Sequence analysis 51

5.1 Introduction 515.1.1 Sequence analysis resources: Starting points . . . . . . . . . . . . . . . . . 51

5.2 Exercises 525.2.1 Proteins of unknown function in the Saccharomyces cerevisiae

(baker’s yeast) genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.2.2 Sequence composition and inferred properties . . . . . . . . . . . . . . . . 525.2.3 Secondary structure prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.2.4 Secondary structure prediction ctd. . . . . . . . . . . . . . . . . . . . . . . . . . 535.2.5 Searching for sequence motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.2.6 Sequence motifs are expressed as consensus motifs . . . . . . . . . . . . 545.2.7 How do we find motifs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.2.8 Searching sequence motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.2.9 The CAAX box motif . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.2.10 Comparing sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2.11 Aligning sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6 Molecular docking 58

6.1 Introduction 586.1.1 Overview of the procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.2 Exercise: Docking imatinib to abl protein tyrosine kinase 596.2.1 Preparing the receptor input file . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2.2 Preparing the ligand input file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2.3 Defining the search area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.2.4 Create the Vina configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.2.5 Run Vina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.2.6 Inspect the results in Pymol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7 Python programming 62

7.1 Introduction 627.1.1 Python vs. Gnuplot or LATEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.1.2 Is programming easy? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.1.3 Why Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637.1.4 How Python programs are created and executed . . . . . . . . . . . . . . . 63

7.2 First steps 647.2.1 Python’s interactive mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.2.2 Naming pieces of data: Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7.3 Keywords and builtins 657.3.1 Some names are special . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.3.2 Python keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667.3.3 Python built-ins whose names are not protected . . . . . . . . . . . . . . . 66

7.4 Data types 66

Page 6: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

vi CONTENTS

7.5 Working with more data: Containers 677.5.1 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677.5.2 How variables work with mutable objects such as lists . . . . . . . . . . 687.5.3 Testing for identity and equality . . . . . . . . . . . . . . . . . . . . . . . . . . . 687.5.4 List slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697.5.5 Tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697.5.6 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707.5.7 Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717.5.8 Tuples vs. lists as dictionary keys . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7.6 Repeated execution: Loops 717.6.1 Iterating over a dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727.6.2 Iterating over strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737.6.3 More fun with strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737.6.4 Exercise: Translating DNA to protein . . . . . . . . . . . . . . . . . . . . . . . . 74

7.7 List comprehensions 747.7.1 Exercise: Use a list comprehension to translate a DNA sequence . . 75

7.8 Nested containers and loops 757.8.1 Nested containers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757.8.2 Nested loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767.8.3 Rewrite this using list comprehensions? . . . . . . . . . . . . . . . . . . . . . 76

7.9 Conditional execution 777.9.1 Conditional execution inside a loop . . . . . . . . . . . . . . . . . . . . . . . . . 77

7.10Boolean evaluation of expressions 787.10.1 Alternative formulation of conditionals . . . . . . . . . . . . . . . . . . . . . . 787.10.2 Exercise: What about 6? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

7.11Functions 797.11.1 Defining functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797.11.2 Functions with default arguments . . . . . . . . . . . . . . . . . . . . . . . . . . 807.11.3 Exercise: Generating random passwords . . . . . . . . . . . . . . . . . . . . . 80

7.12Importing code 807.12.1 Importing self-written code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7.13Exceptions 827.13.1 Catching exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

7.14Reading and writing files 837.14.1 Writing files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 837.14.2 Files and functions: Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

Page 7: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

Listings

3.1 Gnuplot script to fit activation of dopamine receptors by aripiprazole . . 28

3.2 Datafile for Gnuplot script in listing 3.1 . . . . . . . . . . . . . . . . . . . . . 28

3.3 Gnuplot script to fit and plot dopamine receptor activation by aripiprazole 29

3.4 Gnuplot script to fit up- and down-regulation of serotonin receptors . . . 29

3.5 The data file for listing 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.6 Gnuplot script that fits single- to quadruple-exponential decays to thedaptomyin EDTA dissociation kinetics . . . . . . . . . . . . . . . . . . . . . . 31

4.1 The gyrase.pml script for Pymol . . . . . . . . . . . . . . . . . . . . . . . . . . 49

vii

Page 8: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

viii LISTINGS

Page 9: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

Chapter 1

Introduction

1.1 Overview

1.1.1 Before we begin. . .

1. This class is experimental – I have never taught it, or even anything like it, before. . . update: this is actually the second time, but the following still applies

2. There will be a certain degree of chaos3. There will very likely be things that I forget to explain. If you lose the plot, please

tell me. Please feel free to ask anything, at any time, you will be helping me andeach other in this way

4. Course website: http://watcut.uwaterloo.ca/chem731/

1.1.2 Course topics

1. Linux2. LATEX3. Data evaluation and presentation with Gnuplot4. Molecular visualization with Jmol and Pymol5. Sequence analysis programs6. Molecular docking7. Programming in Python

1

Page 10: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

2 CHAPTER 1. INTRODUCTION

1.2 Linux

1.2.1 What is Linux?

1. Re-implementation of Unix, started as a hobby but now developed by both volun-teers and companies such as IBM and Novell, so no longer a toy

2. Used more commonly on servers but also usable as a desktop environment3. Open source, meaning that everyone can dowload the source code, modify it, and

redistribute their modified code4. Patchwork architecture: Multiple versions of everything, including graphical user

interfaces5. Can be a pain in the behind to get running and trouble-shoot6. Many different variations (“distributions”) – have a look at distrowatch.com if

you are interested.

1.2.2 Why use Linux, if it is a pain in the behind?

1. Many scientific programs were originally developed for Unix workstations andtherefore usually also run on Linux

2. Because of its heritage, Linux is a good learning environment for Unix – somepeople may end up having to work with Unix work stations

3. No automatic “you forgot to update Adobe BlahBlahBlah” warnings Update: Linuxis catching up – now shows you nuisance messages aplenty by default, too

4. No viruses – I have never had anything in some six years of daily use, despite notrunning any virus protection software (then again, I don’t go to ripped movie sitesa lot)

5. To scare away the amateurs6. Some people like a pain in the behind. . .

1.2.3 Recommended Linux distributions

Debian Linux or one of its derivatives. Recommended flavours:

1. Debian itself – a little more involved to set up, but provides a very clean and stablesystem

2. Mepis – easier install and slightly better hardware recognition, but no longerleading in this regard

3. Ubuntu – very good hardware recognition and configuration, focus on user friend-liness. My impression is that it contains more bugs than Debian.

4. Linux Mint – based on Ubuntu, with more bells and whistles pre-installed

Debian and Ubuntu have an excellent software packaging system that greatly facilitatesin the installation and configuration of complex programs, including scientific software.All this packaged software is freely available.

1.2.4 Free Software . . .

Good:

Page 11: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

1.3. USING THE SHELL 3

• Free – can’t argue with the price• In most cases, source code free as well – can be used, modified and re-used by

others

Bad:

• Free – no paycheck for developers, many programs developed as a hobby• Quality varies from excellent to horrible, some programs are not maintained –

scientific software often developed as part of scientists’ day jobs though, generallyof good quality

1.2.5 Web resources for Linux

1. ubuntu.org, mepis.org, debian.org2. The Linux documentation project: tldp.org/guides.html – lots of documenta-

tion relevant for any Linux, has a section dedicated to Debian, too

1.2.6 Let’s install Linux, if we haven’t yet

1. Make sure you prepare for the worst – backup all data you care about2. Put your CD into drive and reboot3. Cross your fingers and knock on wood4. Create root and swap partitions as required: Root ≥ 10 GB, swap 0.5-1 GB5. Install grub to master boot record – lets you switch between Windows and Linux

during boot

After install: Try your internet connection, if it doesn’t work, try to fix it

If you plan on using Linux in the long term, it may be better to create additionalpartitions. My setup usually looks similar to this:

• Two system partitions to hold Linux installations (6–8 GB per partition is enough)• One large partition that holds my data. I do not make this my home directory,

because the home directory contains all kinds of hidden files with settings, andif I use the same home directory from different Linux installs, these are going tostomp on each other’s feet, overwriting each other’s settings.

• A swap partition, with 1–2 times the size of the RAM. May not be needed if youhave 3-4 GB of RAM or more.

1.3 Using the shell

In the olden days, when you powered up a computer, it would land you in the shell,which on PCs was called MS-DOS. The computer would wait for you to type a commandand then press Enter. It would then execute this command and dump you back intothe shell, waiting for the next command. Unix machines traditionally operated the sameway, as did the original Apple computers (pre-Macintosh), which used some dialect ofBasic as the shell language.

Page 12: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

4 CHAPTER 1. INTRODUCTION

MS-DOS, or any other shell, understands a limited number of built-in commands. Inorder to achieve anything useful, several commands usually have to be executed inseries. So that you don’t have to enter the same sequences of commands over and overagain, you can write the commands into a text file instead and save this file under asuitable name. You can then simply enter the name of this script or batch file, just likeany built-in command, and the shell will execute all the commands it reads from thistext file, all in one pass. Therefore, by writing your file, you can craft a new commandfrom a set of existing ones – which is the essence of what we call programming.1

When the first Macs appeared with easy-to-use graphical user interfaces (GUIs), theyfound a mixed reception; both positive and negative reactions had their valid reasons.On the one hand, a program with a well-designed GUI makes it much easier to pick upits basic operation intuitively. On the other hand, GUIs tend to get cumbersome withmore advanced program usage. For this reason, some programs combine a GUI withshell-style operation and scripting. We will see examples in this course2

While GUI-driven programs may be easier to use, shell-driven ones are easier to write.They receive all input at the beginning, and produce all output at the end. In contrast,a GUI program has to continually watch for new user input while also processing theprevious input. To avoid such complexities, our own programming exercises in thisclass will use the shell, and it therefore is necessary for us to learn how to use it.

1.4 The bash shell

On Linux, the most widely used shell is bash, based on the older Bourne shell, fromwhich it derives its name (bash = “Bourne Again Shell”). As stated above, it works similarto MS-DOS, but it has a more powerful and versatile set of commands. It also has apretty hostile syntax, so that using it for advanced tasks is not fun. However, basicusage is easy, and for anything advanced it is pretty easy to substitute it with somethingmore readable and pleasant such as Python; we will soon see how that is done.

We can use the bash shell from within our GUI by opening a console window from themenu (the exact location will vary with your system).

1.4.1 Some basic shell commands

Try to bring up a console window from your menu. You need to hunt for it – the exactlocation in the menu will vary with the distribution that you have installed.

1Another widely used form of programs that often are not appreciated as such are spreadsheets. Eachtime you enter a formula into a spreadsheet, you are in fact programming – chances are, therefore, thatyou have already successfully written your own programs.

2This is not limited to Linux or Unix. Microsoft Office has its own Basic dialect built in that lets youprogram add-ins to extend its functionality. An example add-in for Excel is SpectraAnalysis.xla (seehttp://www.science.uwaterloo.ca/~mpalmer/software.html).

Page 13: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

1.4. THE BASH SHELL 5

1.4.2 Becoming the super user

On Debian or Mepis, type:

su

and press <enter>. This will prompt you for the root password that you entered duringinstallation. On Ubuntu or Mint, type

sudo su

When prompted for a password, type your normal user password – there is no separateroot password in this case. If you like, you can create a separate root password, whilein super user mode, with

passwd

Then type your chosen root password.

System administration, including software installation, requires super user or root priv-ileges. On multi-user systems, such tasks are reserved to the systems administrator.On your own laptop, that is you, and there may not be a need for separate root account.Ubuntu and its derivatives (Mint) have done away with it, relying instead on the sudocommand to perform administrative tasks. On these systems, you can either prefixeach single administrative command with sudo, or you can become super user for thesession with sudo su.

1.4.3 Installing some software

Software can be downloaded and installed directly from the command line. Let’s try it:

apt-get install fortunes fortune-mod

This will install two software packages. When all is done, issue

exit

to leave the super user mode.

1.4.4 Fortune cookies

Let’s test our new piece of software. Issue

fortune

try it again. . . see how useful it is?

The shell keeps a command history. You can repeat the last command by hitting theupward arrow. Hitting it twice takes you back to the command before that, and so on.

Page 14: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

6 CHAPTER 1. INTRODUCTION

1.4.5 Throwing fortune cookies into black holes

If you don’t want to see the fortune cookie, you can type:

fortune > /dev/null

Even more useful. /dev/null is the system’s black hole device. If you don’t want tosee the output of some command, you can redirect it into the black hole as illustratedhere.

1.4.6 Saving fortune cookies for posterity

Instead of the black hole, we can also redirect the output to a file and so preserve it.Issue

fortune > wisdom

Then issue

ls

You will see a new file named wisdom, which contains your fortune cookie. Now issue

cat wisdom

to have it printed to your console window.

1.4.7 Saving more fortune cookies for posterity

If you repeat the steps above, each new cookie will overwrite the previous one. Nowissue:

fortune >> wisdom; printf “\n” >> wisdom

and repeat these two commands a couple of times (using the up-arrow for convenience).Now, the output of the fortune command got appended to the file instead; the printfcommand served only to insert empty lines between the cookies.

1.4.8 Our own cookie factory

Issue the command

nano

This should open a text editor called “nano” within your console window. If it doesn’t,become root (su) and issue

apt-get install nano

type exit to become yourself again and then bring up nano.

Type the following:

Page 15: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

1.4. THE BASH SHELL 7

#!/bin/bashcat /dev/null > wisdom # black hole > filefor i in $(seq 1 1 5)do

echo "Cookie no. $i" >> wisdomecho "------------" >> wisdomfortune >> wisdomecho >> wisdom

done

Now press ctrl+o, give the file name “cookiefactory”, press enter, and then ctrl-x toexit.

Confirm that your new file exists by issuing

ls -l

Now issue the command

cookiefactory

What happened?

Let’s try

./cookiefactory

This time, it denies permission, which is progress – at least it found the file. Let’s fixthat:

chmod +x cookiefactory

makes the file executable. Now

./cookiefactory

should work.

1.4.9 A fancier cookie factory

Open up the file again:

nano cookiefactory

Change it like this:

#!/bin/bash

for i in $(seq 1 1 $2)do

echo "Cookie no. $i" >> $1

Page 16: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

8 CHAPTER 1. INTRODUCTION

echo "------------" >> $1fortune >> $1echo >> $1

done

Save with ctrl-o and quit nano with ctrl-x.Use it like

./cookiefactory w1 5

to write 5 cookies to a file named w1.

1.4.10 Viewing documentation with man

The man command lets you access documentation (so-called man-pages) for the variousprograms and commands. For example, type

man less

to learn everything there is to know about the less command.

1.4.11 Searching documentation with apropos

Let’s say we want to convert something to pdf. How can we find out what programscould help us with that? Type

apropos pdf

The apropos command searches all available man pages for a word or phrase (herepdf). However, the output that it spits at us may be a bit longish. We can filter it withthe grep command:

apropos pdf | grep -i convert

Use apropos to find out how to get a list of the fonts available on your system.

Several new things have been introduced here:

1. The | character sets up a pipe – the output of the apropos command is fed asinput to the grep command

2. the -i option causes grep to ignore case – both "convert" and "Convert" will beaccepted

As for the list of fonts, try:

apropos fonts | grep -i list

That should give you a short list of search results, among which you should find thecommand fc-list.

We will learn a few more shell commands later in this course. If you want to learn moreon your own, have a look at http://tldp.org/LDP/Bash-Beginners-Guide/html/index.html.

Page 17: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

Chapter 2

LATEX

2.1 Prerequisites

We first need to install LATEX itself, as well as an editor suitable for writing LATEX docu-ments.

From your package manager, install the LATEX editor Texmaker. This should automati-cally also install the essential parts of TexLive, a complete LATEX installation. If it doesnot, manually install package texlive as well.

If you search your package manager for TexLive, it will show you a long list of packages.The names of some end in ’-recommended’. Install those as well.

Before you do this, it would be a good idea to make sure that your package manageruses the repository at mirror.csclub.uwaterloo.ca. Downloads from there are veryquick.

2.2 Overview

2.2.1 What is LATEX?

1. A programmable typesetting system, based on TEX2. Good for typesetting mathematics – widely used for publishing books or journals

in math and physics3. Suitable for large, structured documents like reports, papers, books, theses, with

or without mathematics4. Documents contain a mixture of text and formatting instructions (“markup”)5. Extensible by user – very many special-purpose packages have been programmed

9

Page 18: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

10 CHAPTER 2. LATEX

6. Various output formats; in practice, PDF output is usually what we want

Using LATEX well needs some study. There is a boatload of documentation available. Hereare some valuable resources:

• the LATEX FAQ athttp://www.tex.ac.uk/cgi-bin/texfaq2html?introduction=yes

• the “not so short introduction to LATEX" – should have come with your TexLiveinstallation. Type texdoc lshort into a console to view

• CTAN http://www.ctan.org – a repository of all kinds of LATEX packages. Mostof the mature, widely used packages come with TexLive though.

Notice the texdoc <package> trick used above. For most packages, this should findand display information installed on your system. Try texdoc mhchem to see if it works.

2.2.2 Example LATEX markup (source code of slide above)

\begin{frame}\frametitle{What is \LaTeX?}

\begin{enumerate}

\item A programmable typesetting system,based on \TeX{}

\item Good for typesetting mathematics --widely used for publishing books orjournals in math and physics

...\end{enumerate}

\end{frame}

What we can see here is that LATEX cannot only be used for printed documents but alsofor slides.

We also see that LATEX uses the concept of logical markup. The key idea behind logicalmarkup is the separation of content and presentation: In the text, we only specify whatis a heading, what is a normal paragraph, and so on. Attributes such as font, font weightand size, color etc. are defined elsewhere, and these definitions can easily be applied orreplaced with others, without changing the text.

2.2.3 “Logical markup”: Separating content from presentation

Content with logical markup

Typeset document

External style file maps logical markupto actual formatting instructions

Page 19: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

2.3. EXAMPLES AND EXERCISES 11

It is advisable to use logical markup wherever possible. This has several advantages:

1. Your document will have a consistent look2. You can easily change the layout, without going through your document again3. Logical markup tends to be more legible and concise4. You can use your content in different formats

As an example, of the last item above, I produce my printed course notes and slidesfrom the same source files. Huge time saver.

2.3 Examples and exercises

2.3.1 The source file for this presentation

% the beamer document class produces slides\documentclass[ignorenonframetext, serif]{beamer}

% some customizations reside in this package\usepackage{beamerslides}

% tell LaTeX where to look for images\graphicspath{{/data/chem731/images/}}

% here, we include the actual content\include{latexcontent}

I have a separate file for producing these course notes, which is a bit longer. However,the key point is again the instruction \include{latexcontent}, and similar instruc-tions for the other chapters.

In the xy-content source files, I have one frame environment for each slide, and noteslike this one between the frame environments. The beamer class option ignorenoframetextwill cause this additional text to be disregarded when creating the slides. For the typesetnotes, I use a simple trick to convert the content of the slides to plain text.

This setup makes it easy to keep slides and notes in sync and is actually quite fun towork with.

2.3.2 Should I use LATEX or a word processor?

LATEX is good

• with large documents (like a thesis)• if you are in charge of the layout (thesis)• if you don’t mind spending some time to learn it

Page 20: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

12 CHAPTER 2. LATEX

LATEX is no better than a word processor

• with small documents that don’t need much formatting – however, those may begood for practicing

LATEX is more trouble than it’s worth

• with paper manuscripts that are going to be typeset by the publisher anyhow(exception: publishers that ask for LATEX

• if you must cooperate on the document with someone who refuses to use LATEX

If you search the web for guidance on this choice, you will find a lot of LATEX zealotsranting about it, and quite often it is clear that they haven’t touched a word processorin the last 50 years or so. However, there are still valid reasons to prefer LATEX – it reallygives you more flexibility and power than a word processor.

Basic usage of LATEX, for example a for a thesis, does not take too long to learn. Theautomatic placement of figures and tables alone will probably more than compensateyou for the amount of time you need to spend on learning it.

If you decide to stick with a conventional word processor, it is still a good idea to followthe principle of separating the logical structure of the document from the formatting.Both Word and OpenOffice let you do this, although it is a little less obvious how.

2.3.3 Exercise 1: Create a LATEX document

1. Start Texmaker2. Select File > New3. Select Wizard > Quickstart4. Set papersize to letterpaper5. Set encoding to utfx86. click OK7. Save the document as exercise1.tex, preferably in a new folder8. From the first drop-down menu in the tool bar, select PDFLaTeX, and then click on

the blue arrow next to it. This will compile the document.9. From the second drop-down, select View PDF, and click on that blue arrow. You

should now see the compiled document on the screen.

You have now before you the skeleton of a LATEX document.

2.3.4 Excercise 1 ctd.

• insert \maketitle• use lipsum package• add an abstract• type some text• type some lists• insert sectioning commands• some font formatting commands: bold, italics, font sizes

Page 21: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

2.3. EXAMPLES AND EXERCISES 13

• super- and subscripts• create a shortcut for “Pneumonoultramicroscopicsilicovolcanoconiosis”• customize hyphenation• load nicer fonts• adjust margins

These things are illustrated in the file exercise.tex that I will be sending along. Openit with Texmaker, run it through PDFLaTeX, and view the resulting PDF file.

2.3.5 Exercise 2: Use the UW thesis template

• Open the file testthesis.tex that I sent around earlier.• Compile with PDFLatex. Does it work? Let me know if it does not.• Adjust Texmaker’s Quickbuild command: Options > Configure Texmaker >Quickbuild > User. Into the text field at the bottom, type (all in one line):

pdflatex -interaction=nonstopmode %.tex|bibtex %.aux|pdflatex -interaction=nonstopmode %.tex|pdflatex -interaction=nonstopmode %.tex

• From the first drop-down in the tool bar, select Quick Build and hit the bluearrow next to it.

Page 22: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

Chapter 3

Gnuplot

3.1 Installation

Gnuplot can be installed as a package of that name through your friendly packagemanager. It is a good idea to also get the package gnuplot-doc, which contains a lot ofworked examples. The documentation for Gnuplot in PDF format does not seem to bein the package but can be found on Gnuplot’s website.

3.2 Introduction

Gnuplot can

1. plot experimental data2. plot mathematical functions (y = x2)3. plot data and functions together4. fit function parameters to experimental data5. plot 3D graphs

If you look around the Gnuplot website, you will see all kinds of fancy, colorful 3Dgraphics. I haven’t got enough neurons left to appreciate those – the focus here will beon 2D graphics and data fitting.

14

Page 23: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

3.3. PLOTTING FUNCTIONS AND FILES 15

3.3 Plotting functions and files

3.3.1 Start Gnuplot

• Bring up console• type gnuplot -V to see your program version; make sure you obtain the docu-

mentation that matches your version• type gnuplot• type plot (x+3)**2 title ’a parabola’• close plot window, type ctrl+d

As you can see, Gnuplot is driven from the command line. Typing gnuplot -V tellsGnuplot to simply print its version number and then exit. If you type gnuplot, Gnuplotstarts and sits there, waiting for you to tell it what to do, just like a regular shell. Youcan then interactively plot a function, as we have done.

3.3.2 Running a gnuplot script file

• cd to the folder with the practice files I sent around• run gnuplot poteffx.plt• When you are done admiring the graph, click into the window• If you clicked the close button of the window, Gnuplot hangs; press ctrl+d to exit

For anything advanced, however, you don’t want to use Gnuplot interactively, becauseit will forget all your hard work once it exits. Instead, you will usually type up allcommands in a script file and then let Gnuplot run it.

The script file contains both commands and explanatory comments. The easiest way tolearn Gnuplot is by looking at and playing with examples. To take full advantage of it,it is necessary to read the documentation, which is reasonably well written and quitecomplete, although a bit short on examples. A good website with worked examples ishttp://t16web.lanl.gov/Kawano/gnuplot/index-e.html.

In this exercise, we again saw an interactive display. It is more useful to save the plotto a file, however.

3.3.3 Saving a plot to file

• Run gnuplot poteff.plt. That should give you a pile of strange-looking text.This is in fact a PostScript description of the plot. PostScript is a documentdescription language that is similar and can easily be converted to PDF.

• Run gnuplot poteff.plt > test.eps to send the PostScript to a file.• Run gv test.eps to admire the fruit of your hard work.• Run epstodpf test.eps. This will convert the eps file to a pdf file, which we can

for example use in LATEX.• As a shortcut, run ./gnuplot-pdf poteff.plt• Convert PDF to png: convert -density 300 poteff.pdf poteff.png

Page 24: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

16 CHAPTER 3. GNUPLOT

Gnuplot can produce plots in various formats. To this end, it uses a variety of different“terminals”, or output routines. The dirty little secret is that these all have their differentsettings, abilities and limitations.

The EPS (encapsulated PostScript) terminal is mature and versatile. EPS can easily beconverted to other graphics file formats, in particular pdf. For use in LATEX, you shouldalways use a vector graphics format, that is in practice PDF.

The little gnuplot-pdf script runs gnuplot and epstopdf first and then shows you theresult in gv. Once you close gv, you still have the pdf file.

Only were you cannot use PDF, such as on a web page, should you use pixel graphics(PNG). Most word processors still can’t use PDF, so you are stuck with PNG. The convertutility lets you control the resolution of the resulting file. Use a high resolution to makeit look good in print, for example convert -density 600 plot.pdf plot.png.

Addendum: I have found that convert makes some dents into the plot graphs occa-sionally. It seems that pdftoppm works better. This program is part of the packagepoppler-utils. Type man pdftoppm to find out how to use it.

3.3.4 Including a plot in LATEX

• Bring up Texmaker• Load the file poteffplot.tex• Run it through PDFLaTeX and look at the output.

This is only a quick illustration that Gnuplot and LATEX go well together. We won’telaborate further.

3.3.5 Plotting data files

• Run ./gnuplot-kpdf sw17.plt• Looks nice, too, doesn’t it• Run less sw17.plt to inspect the file• Hit q and then run less sw17.dat

This example shows how to plot data from a file. The data are organized in columnsseparated by one or more spaces. The columns are selected with the using clause, forexample using 1:3 uses the first column as x, and the third column as y values.

The plot file is much shorter than the previous one, since most of the settings havenow been factored out into a separate file that is simply loaded at the beginning. Onmy computer, I keep a similar setup file in /gnuplot/setup_eps.plt, and I just loadit into new plot files with load ’ /gnuplot/setup_eps.plt’. This saves me fingerstrokes and gives my plots a consistent look every time.

Settings that we want to change can still be overridden – for example if we first say setlogscale x and later unset logscale x, the second command will take effect.

Page 25: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

3.4. CURVE FITTING WITH GNUPLOT 17

3.3.6 Working with data files in different formats

• Run less a-chym.dat – note that file contains no x values• Press q, run gnuplot achym-dat – values are plotted, but from x=0 (wrong)• Run gnuplot, then plot ’a-chym.dat’ using ($0+178):1• Other transformations could also be applied, for example plot ’a-chym.dat’using (log($0+1)):1 could be used to construct a logarithmic x axis.

3.3.7 Plotting CSV files; multiple data sets

• Widely used low-tech format, easy export and import with spreadsheets• Just say set datafile separator ’,’• File chym.csv is an example – see whether you can get it plotted• Run ./gnuplot-pdf waldhoer.plt; inspect files

Multiple data sets can reside in one file when separated by two or more empty lines. Theindex clause selects one or more data sets, counting from 0; for example 0:0 selectsonly the first data set, whereas index 1:2 selects the second and third data set (youwould not often need multiple selection, though).

Not illustrated: Intervals can be selected with every, for example every 5 select everyfifth value only.

3.4 Curve fitting with Gnuplot

In addition to plotting functions and data, Gnuplot also lets us fit functions to data sets.Function fitting (or curve fitting) can be done for different purposes:

1. Testing an theoretical model with measurement data – here, we typically need to fitand test alternative models. Example: Single- vs. double-exponential fluorescencedecay

2. Obtaining values for the parameters of an accepted model for a given set ofexperimental data. Example: Time course of drug excretion in a single patient –we assume it’s single exponential and don’t consider any alternatives

3. Creating trend lines to “guide the eye”, using some arbitrary, simple equationsthat need not have any exact physical meaning (e.g. the Hill equation)

Gnuplot employs the widely used Levenberg-Marquardt numerical fitting algorithm,which can be used to fit arbitrary functions to given data sets. The one importantlimitation in Gnuplot is that the function must be provided in an explicit form, that iswe cannot use iterative numerical procedures to calculate the value of the function, asmay be necessary if for example the function is only defined by a system of differentialequations. In that case, we need a real programming language such as Python, which letsus use the Levenberg-Marquardt algorithm together with arbitrary ways of computingfunction values.

Page 26: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

18 CHAPTER 3. GNUPLOT

3.4.1 Theories without adjustable parameters

The law of Hagen and Poiseulle describes the velocity of laminar flow in a capillary:

dVdt= πr

4∆p8ηl

The answer to the ultimate question of life, the universe, and everything, according toDouglas Adams:

42

Curve fitting requires that some of the parameters of a given function can reasonablybe treated as variable. The point of these examples is to show that this is not alwaysthe case – in such cases, curve fitting is not applicable.

3.4.2 Theories with adjustable parameters

Hooke’s law: The extension of a spring is proportional to the force applied to it

F = −kx

Michaelis-Menten law of enzyme reaction velocity: The velocity is proportional to thesubstrate saturation, which in turn follows mass action kinetics

V = Vmax[S]

[S]+KM(3.1)

In the case of Hooke’s law, k is the variable parameter; it is not universal but is a propertyof the particular spring under study that must be determined experimentally. SinceHooke’s law is a linear equation, we don’t need numeric fitting but can simply applylinear regression. However, numeric fitting can do the job also, and indeed Gnuplotdoesn’t seem to provide built-in linear regression.

The Michaelis-Menten law is not linear, so we cannot directly apply linear regression.We can apply it if we transform the equation into a linear shape; this is the point ofthe Lineweaver-Burk plot and some other plots that are traditionally taught in enzymekinetics. However, using numerical fitting, we can evaluate the data directly and canavoid the distortion inherent in the linear transformations.

Page 27: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

3.4. CURVE FITTING WITH GNUPLOT 19

3.4.3 Numerical curve fitting by gradient descent

6 8 10 12 14 2 4

6 8

10 12

0

4

8

12

Total error

parameter 1

parameter 2

Total errorStarting point (arbitrary parameter values)

Numerical fitting minimizes the total error as a function of all variable parameters.With two variable parameters, we can envision the error function as a surface withvalleys and mountains; the lowest point is at the optimal combination of values for thetwo parameters. Numerical fitting algorithms like the one devised by Levenberg andMarquardt work by gradient descent, that is by climbing downhill on this error surface.

To get going, we need to provide a starting point, that is supply some arbitrary valuesfor the variable parameters. The algorithm then explores the terrain close by and movesa step in a downhill direction. It repeats this until it can find no lower point, and thenstops.

Even without going into the mathematics (which I would have to retrieve from a textbookalso) this sketch explains why numerical fitting is general: The algorithm only considersthe slope of the error surface, it does not care about the underlying function that erectsthe surface.

With more than two variable parameters, this visualization no longer works, but theidea is still the same – march “downhill” in an (n+ 1)–dimensional space.

3.4.4 Example: Receptor activation by ligand

Assumptions:

1. Ligand binding to the receptor follows mass action kinetics2. Receptor is completely inactive without ligand bound, and fully active with ligand

bound. That is, receptor activation equals receptor saturation.3. Ligand binds according to law of mass action: K = [L][R]

[LR]

The degree of receptor activation, as a function of [L], then becomes:

A = [L][L]+K (3.2)

This is a very simple case, with just one variable parameter (K). Example data, and aGnuplot file to fit them to the above equation are listed below. If you run the Gnuplot

Page 28: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

20 CHAPTER 3. GNUPLOT

file in listing 3.1, it will tell you that the best fit to the data is obtained with k = 716.717.It also prints a whole lot more information, some of which we will consider later.

0

20

40

60

80

100

101

102

103

104

105

Recepto

r activation (

%)

Ligand concentration (nM)

kstartkfit

This figure uses the same data as example ?? and shows the curves obtained with anarbitrarily chosen starting value for K, as well as with the fitted K. The plot file thatfirst produces this figure is listed below.

3.4.5 The 5-HT2B receptor can be up- and down-regulated by ligands

40

80

120

160

10-3

10-2

10-1

100

101

102

103

104

IP3 r

ele

ase (

%)

Ligand concentration (µM)

aripiprazole

serotonin

The 5-HT2B receptor activates phospholipase C, which releases inositoltriphosphate(IP3). The drug aripiprazole increases receptor activity, but the physiological ligandserotonin decreases it; therefore, receptor activity must be greater than zero withoutthe drug.

Clearly, we need to modify our activity function to account for this behaviour.

3.4.6 Receptor activation or inhibition by ligand – theory

Assumptions:

1. Ligand still binds according to law of mass action: K = [L][R][LR]

2. Receptor has a basal level of activity Afree in the unbound state, and some otherlevel of activity Abound with bound ligand.

Page 29: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

3.4. CURVE FITTING WITH GNUPLOT 21

→ Degree of receptor activation, as a function of [L]:

A = Afree + (Abound −Afree)[L]

[L]+K (3.3)

In equation 3.3, Afree could be treated as variable, or as fixed. The activity data shown insection 3.4.5 have been normalized to 100% in the absence of ligand, so it seems naturalto fix Afree at that value. On the other hand, since not all data points were used in thatnormalization, we might obtain a nicer looking fit with letting Afree float. Which choiceis better?

3.4.7 How many variable parameters should we use?

“With four parameters I can fit an elephant, and with five I can make him wiggle histrunk.”

John von Neumann The more variable parameters we allow, the more likelyit becomes that the theoretical model will be able to adopt a shape consistent with theexperimental data, even without any inherent physical validity. So, generally speaking,the fewer variable parameters, the better; parameters should only be treated as variableif there is a sound reason for it.

So, in our example, the right choice is to fix Afree.

3.4.8 How good is the fit?

If we run the Gnuplot file from the above example like so:

gnuplot iprelease.plt

we see the summary of the second fit close to the bottom of the screen, but the firstone is buried in clutter. Let’s filter the output for the information we want:

gnuplot iprelease.plt 2>&1 | grep variance

This gives us:

variance of residuals (reduced chisquare) = WSSR/ndf : 43.7331variance of residuals (reduced chisquare) = WSSR/ndf : 37.1166 First off,how did the output filtering work: With the | character, we sent Gnuplot’s outputthrough a pipe to grep, which filtered it for lines containing the word “variance”. Theidea of a pipe is that one program’s output becomes the second program’s input.

To make this work, we had to first redirect Gnuplot’s output from its so-called stderr,or standard error stream, to its stdout, or standard output stream, since only the lattercan be attached to a pipe. The numbers 2 and 1 are so-called file handles with which wecan refer to stderr and stdout, respectively.

The two variance (total error) values that we obtain are for the two fits for the first andthe second data set, respectively. This is what Gnuplot thinks is the reduced χ2 value

Page 30: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

22 CHAPTER 3. GNUPLOT

(see below). For a real calculation of χ2, we would need the error of measurement. Sincewe don’t have the experimental error here, these χ2 values don’t mean much; they areonly useful if we have others to compare them to. Here, we could compare them to analternative fit, with Afree set as variable.

Rewriting the Gnuplot file to perform this fit is left as an exercise to the gentle reader.If you did it right, you should see this output:

variance of residuals (reduced chisquare) = WSSR/ndf : 48.749variance of residuals (reduced chisquare) = WSSR/ndf : 39.4221

These variances are higher than before – the fit got worse, not better. So, we were rightto use a fixed Afree in the first place.

3.4.9 Testing exact theories with inexact data

“From all this it is plain that these observations agree with theory, so far as they agreewith one another.”

Isaac Newton, in discussing his calculations on the comet of 1680 Succinctexpressions of difficult concepts, like this specimen, are the hallmark of true genius. Inmy own, not quite so succinct words: If we want to use curve fitting to test the validityof some theoretical model, we need to know the limits of experimental accuracy. Theexperimental error is usually estimated from the variation of repeated measurements.We then use the following rationale:

total error =measured data− predicted value (3.4)

error of theory = total error−measurement error (3.5)

If all observed errors are accounted for by errors of measurement, error of theory iszero, and the theory is true. Otherwise, the theory is false.

The above pseudo-equations outline the principle but are not literally true. Insteadof the differences between measurement and prediction, we consider the squares ofthose differences. This is based on the assumption that errors of measurement arestatistically distributed around a mean value: Two small deviations are more likely andtherefore are of less concern, if observed, than one large deviation.

3.4.10 Testing a theory with adjustable parameters

A theory with adjustable parameters is considered valid if there exists a combination ofvalues for these parameters that will yield an overall error no greater than the expectederror of measurement.

Therefore, we need to

Page 31: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

3.4. CURVE FITTING WITH GNUPLOT 23

1. Find the best possible combination of values for the adjustable parameters, inorder to minimize the overall error. This is done through numerical fitting.

2. Compare the remaining error to the known error of measurement.

A parameter that helps us to compare the overall error to the experimental error is thereduced χ2.

3.4.11 Evaluating the fit error: χ2

Definition:

χ2 =(t1 −m1

σ1

)2

+(t2 −m2

σ2

)2

+ . . .+(tn −mn

σn

)2

t1, t2 . . . tn theoretical valuesm1,m2 . . .mn measured valuesσ1, σ2 . . . σn standard deviations for measured values

The reduced χ2 is used to make the error estimate independent of the number of datapoints:

χ2red =

χ2

n− p

where p is the number of adjustable parameters, or degrees of freedom. In theabove example (section 3.4.6), we noticed that χ2

red increased with the introductionof another free parameter. We can see now how this works – a greater number ofvariable parameters decreases the denominator of χ2

red.

In the ideal case—all remaining error is error of measurement only—χ2red should reach

a value of 1. In reality, it will usually remain somewhat higher; deciding whether a givenfit is “good enough” often is somewhat arbitrary.

3.4.12 How do we obtain the standard deviations of the measured values?

1. Repeat the measurements a sufficient number of times. Triplicate measurementsare often used but not statistically reliable; 10 repetitions is more like it

2. If the signal consists of a number of discrete counts, such as photons in a photon-counting fluorescence detector or in a β- or γ-counter, we can estimate the stan-dard deviation of the signal N according to: σ =

√N

The first approach is universal. The second one is convenient, but it only gives thetheoretical minimum value of the experimental variance, wich results from countingstatistics alone. Possible sources of error such as for example baseline noise from thedetector or intensity fluctuations of the light source will not be accounted for.

Page 32: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

24 CHAPTER 3. GNUPLOT

3.4.13 A practical exercise: Calcium binding to daptomycin

NH

NH

O

NH

O

NH2

NH

O

NHOH

O

ONHO

O

NH

NH2

O

NH

NH

O

O

OH

CH3

NH

O

O

OHONH

O

NH

CH3

O O

NH

O

O

CH3

OOH

NH

OO

CH3

OH

NH2

Daptomycin is a lipopeptide antibiotic. It contains some non-standard amino acids,including a kynurenine residue (the lowermost aromatic side chain in the figure) that isintrinsically fluorescent. Fluorescence is bright if daptomycin is bound to membranes,but it is dim when it is in solution.

3.4.14 What daptomycin is supposed to do

Ca

Ca

Ca

CaCa

solution

CaCa

PC membranes

Ca Ca

PC/PG membranes

PGCa Ca

K+

Page 33: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

3.4. CURVE FITTING WITH GNUPLOT 25

Daptomycin binds to membranes and forms oligomers. Binding and oligomerization isdependent on calcium ions and on negatively charged lipids such as phosphatidylglyc-erol (PG) in the membrane.

3.4.15 One or more types of binding sites for calcium?

• Incubate daptomycin with membranes and calcium• At t=0, at EDTA to capture calcium• Follow kinetics of daptomycin fluorescence decrease

It is not known for certain how the calcium ions interact with daptomycin and withthe lipids on the membrane. In the simplest case, all binding sites for calcium couldbe equivalent, with the same rate of binding and dissocation. In this case, withdrawalof calcium with a chelator (EDTA) at t=0 should a single-exponential decrease in thedaptomycin fluorescence. On the other hand, if there are different classes of bindingsites with different rates of calcium dissociation, the time course of the fluorescenceshould have two or more exponential terms.

3.4.16 Daptomycin fluorescence after addition of EDTA at t = 0

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30 40 50 60

Flu

ore

scence (

cps/1

06)

Time (minutes)

3.4.17 A single-exponential model

F = Fbasal + Fince−tτ (3.6)

In the experimental time course, the intensity drops off fast initially and then seems tolevel off. It is reasonable to assume that there should be some residual fluorescence att = ∞, which would correspond to the fluorescence of daptomycin in solution. On topof this basal fluorescence, we assume a an additional component that at t = 0 equalsFinc and undergoes a single-exponential decay.

This modes is easily extended by adding more exponential terms, each with its ownpre-exponential (Finc) and time constant (τ).

3.4.18 Fitting with 1 to 4 exponential terms

The Gnuplot script edtafit.plt (listing 3.6) contains the code for running all fits, oneafter another. Invoke like so:

Page 34: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

26 CHAPTER 3. GNUPLOT

gnuplot edtafit.plt 2>&1 | tee fitresults

The | tee trick duplicates the (redirected) output of the gnuplot command – we get tosee it on the screen, and a copy is saved to the file fitresults. This is convenient forlater analysis.

Alternatively, you can use the file fit.log, in which Gnuplot accumulates the outputof all fits.

Sit back and enjoy the numbers scrolling by. Each screenful of numbers shows theresults of one iteration. You may notice that the iterations become slower, and substan-tially more numerous, as we go from simple to complex models.

The Gnuplot script contains a lot of comments and explanations – it merits a good goingover. In particular, notice how we obtain the error estimates from the fluorescenceintensities, according to σ =

√N .

3.4.19 Where are the parameters obtained from the fit?

grep -A 11 ’Final set’ fitresults

With the -A 11 option, grep not only prints each line matching Final set but also thenext 11 lines. Try man grep to learn more about grep’s power. $0.50 (Canadian Tire)for anyone who comes up with a problem that grep can’t solve.

3.4.20 Which fit is the best?

grep variance fitresults

should give you

variance of residuals (reduced chisquare) = WSSR/ndf : 1448.6variance of residuals (reduced chisquare) = WSSR/ndf : 26.8559variance of residuals (reduced chisquare) = WSSR/ndf : 2.89905variance of residuals (reduced chisquare) = WSSR/ndf : 1.444

These are the χ2red values for the 1-, 2-, 3- and 4-exponential fit, respectively. What do

we make of them?

Remember that χ2red has a theoretical minimum of 1 for a perfect fit. Looking at those

numbers above, it is clear that only the 3- and the 4-exponential models come closeenough for consideration. In contrast, the single-exponential model is way beyond themoon, and the 2-exponential model is still in orbit.

3.4.21 Plotting the fit residuals

The results of the fit can also be visualized using the fit residuals:

residuals = t −mσ

(3.7)

Page 35: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

3.4. CURVE FITTING WITH GNUPLOT 27

which ideally should be just statistical flicker around zero.

Run the files edtadiffs.plt to see the plots of residuals from all four fits.

3.4.22 Residuals from a good fit (4 exponentials)

-5

-2.5

0

2.5

5

0 900 1800 2700 3600

Re

sid

uals

Time (seconds)

Here, the residuals are fairly evenly distributed around zero; only in the first ~500seconds is there some apparent systematic distortion that represents data not fittedadequately.

3.4.23 Residuals from a poor fit (2 exponentials)

-20

-10

0

10

20

0 900 1800 2700 3600

Resid

uals

Time (seconds)

Here, the random noise is much smaller than the large movements of the entire curve,which represent a substantial residue that is not adequately covered by the model.Therefore, a 2-exponential model is too simple.

3.4.24 So have we found the truth?

Remember John von Neumann: We may have found the truth, but we may also havefitted a wiggling elephant.

All we can really say is that the kinetic data do not support a model in which a sin-gle class of calcium binding sites, or even two classes of sites kinetically control the

Page 36: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

28 CHAPTER 3. GNUPLOT

release of daptomycin from the membrane – the kinetics of calcium and daptomycindissociation is more complex than that.

3.5 Code and data listings

Listing 3.1: Gnuplot script to fit activation of dopamine receptors by aripiprazole(gppractice/arifit.plt)

# this minimal file only performs the data fit, no plotting.

set datafile separator "," # read data from comma-separated file

# define the receptor activation function, scaled to 100%activation(x, k) = 100 * x / (k + x)

# set an initial value for kk_fit = 5000

# next comes the call to the fit routine. The ’via’ clause# indicates which parameters are to be treated as variable# here, we have only one, but we still need to declare it.

fit activation(x,k_fit) ’ari3-da.csv’ using 1:2 via k_fit

Listing 3.2: Datafile for Gnuplot script in listing 3.1 (gppractice/ari3-da.csv)

#"GTPgS-binding(dopamine), created by Plot Digitizer, 2.4.1"#"Date: 11/22/07, 7:28:29 PM"

#dopamine,GTP-gamma-S-binding9.82516E-1,-4.30702E-11.01527E+1,1.74836E+09.68155E+1,1.31456E+13.09817E+2,3.12565E+19.15102E+2,5.78764E+13.16345E+3,7.84713E+11.01471E+4,8.77169E+19.67349E+4,1.00178E+2

Listing 3.3: Gnuplot script to fit and plot dopamine receptor activation by aripiprazole. Thedata are again from listing 3.2 (gppractice/arifitplot.plt)

# settings for the plotload "setup_eps.plt" # set up the eps terminal

Page 37: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

3.5. CODE AND DATA LISTINGS 29

set output "arifitplot.eps" # write plot to this fileset logscale x # logarithmic x-axisset xtics 10, 10, 1e5 # set x-axis tics from 10 to 10^5

# in intervals of 10set mxtics 1 # hide minor axis tics by

# setting them to 1 per major tickset format x "10^{%T}" # format numbers on x-axis as powers of 10set xrange [8:1.05e5] # define the range of the x axis -

# a little space on the sidesset xlabel "Ligand concentration (nM)"

set ylabel "Receptor activation ({/Symbol %})" offset 1.5,-0.25set yrange [-10:101]set ytics 0, 20, 100 # note that we don’t set a log y scale

set key top left # location of the plot legend

# done with the formatting stuff, now on the actual workset datafile separator "," # read data from comma-separated file

# receptor activation function, scaled to 100%activation(x, k) = 100 * x / (k + x)

k_start = 5000 # set an initial value for the variable# parameter and remember it

k_fit = k_start # k_fit will be different after fitting

# call the fitting routinefit activation(x,k_fit) ’ari3-da.csv’ using 1:2 via k_fit

# k_fit now contains the optimized value. Plot the data and the# function, with both the initial and the fitted values for k.plot "ari3-da.csv" using 1:2 title "" with points pt 6 , \

activation(x, k_start) title "k_{start}" with lines lt 2, \activation(x, k_fit) title "k_{fit}" with lines lt 1

Listing 3.4: Gnuplot script to fit up- and down-regulation of serotonin receptors(gppractice/iprelease.plt)

# IP3 release after 5-HT2B receptor activation# fit and plot dose-effect curves for aripiprazole and serotonin

load "setup_eps.plt"set output "iprelease.eps"set datafile separator ","set format x "10^{%T}"set xlabel "Ligand concentration ({/Symbol m}M)"set logscale xset key top left width -6 font "Helvetica,18"

Page 38: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

30 CHAPTER 3. GNUPLOT

set ylabel "IP_{3} release (%)" offset 1.5,-0.25set yrange [40:160]set xrange [0.5e-3:1.1e4]set ytics 40, 40, 160set mytics 1

# the interesting part

# receptor acitivity function. We fix the starting activity at 100%.# K, as well the final activity, will vary freely.activity(x, a_final, k) = 100 + (a_final - 100) * x / (k + x)

# define separate variables to be fitted for the two data sets# aripiprazolea_final_ari= 50k_ari = 100

# serotonin (5-ht)a_final_ht = 150k_ht = 100

# perform the fitsfit activity(x, a_final_ari, k_ari) "pi-hydrolysis.csv" \

using 1:2 index 0:0 via a_final_ari, k_ari

fit activity(x, a_final_ht, k_ht) "pi-hydrolysis.csv" \using 1:2 index 1:1 via a_final_ht, k_ht

# here, we plot only the fitted functions, not the starting onesplot "pi-hydrolysis.csv" using 1:2 index 0:0 with points pt 6 title "aripiprazole", \

activity(x, a_final_ari, k_ari) with lines lt 3 title "", \"" using 1:2 index 1:1 with points pt 7 title "serotonin", \activity(x, a_final_ht, k_ht) with lines lt 1 title ""

Listing 3.5: The data file for listing 3.4 (gppractice/pi-hydrolysis.csv)

# phosphatidylinositol hydrolysis in response to# serotonin receptor type 2B activation# two data blocks, separated by two or more empty lines

# with aripiprazole. This data block is selected# with ’index 0:0’ in the corresponding plot file9.80193E-3,1.01172E+21.02716E+0,8.41158E+11.07407E+1,7.79321E+13.19429E+1,5.86920E+11.08057E+2,4.72681E+11.08000E+3,4.99374E+1

Page 39: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

3.5. CODE AND DATA LISTINGS 31

# with 5-HT (serotonin). Selected with ’index 1:1’1.02478E-3,9.58997E+11.02129E-1,1.13217E+21.06958E+0,9.92212E+19.77547E+0,1.14909E+23.43921E+1,1.24840E+21.06137E+2,1.38414E+21.01405E+3,1.49415E+21.01268E+4,1.56251E+2

Listing 3.6: Gnuplot script that fits single- to quadruple-exponential decays to the daptomyinEDTA dissociation kinetics (gppractice/edtafit2.plt)

# fit the edta kinetics experiment, no plots# somewhat simplified from a version that was included earlier

set datafile separator ","

# we test out between one and four exponential decays. We will declare# only one function for all these different cases (named exp4):

exp4(t) = fbas + \finc1 * exp(-t/tau1) + \finc2 * exp(-t/tau2) + \finc3 * exp(-t/tau3) + \finc4 * exp(-t/tau4)

# note that this function only receives one parameter - the time.# The other parameters must exist in the "global" space - we will# define them below.## we glean initial values for the parameters from the data. At time 0,# the intensity is ~1.2 million, at the end it’s around 0.3 million.# So, we use 0.3 million as the basal fluorescence, and the remainder# as the incremental fluorescence that participates in the exponential# decay(s) - the pre-exponential.

ftotal = 1.2e6fbas = 3e5finctotal = ftotal - fbas

# Initially, we use only one exponential term. We assign it all the# incremental fluorescence as the pre-exponential.finc1 = finctotal

# Fooling exp4: we set all unused pre-exponentials to zero,# so that they will not affect the result of the calculation.

Page 40: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

32 CHAPTER 3. GNUPLOT

finc2 = finc3 = finc4 = 0

# we use a guess for the time constanttau1 = 300

# we set all other time constants to 1, so that we don’t get a zero# division error - the tau values are in the denominator of the exponenttau2 = tau3 = tau4 = 1

# fit the single-exponential model. The only parameters that we will# allow to vary are fbas, finc1 and tau1. This is determined by the ’via’# clause.# also note the ’using’ clause: The third element is the error associated# with each data point. Here, we use the square roots of the intensities# as estimates for the error, which applies to measurements of stochastic# signals such as fluorescence, radioactivity and similar.

fit exp4(x) "edta_kinetics.csv" \using 1:2:(sqrt($2)) \via fbas, finc1, tau1

# for the two-exponential fit, we assign finc2 and tau2 some initial# values and include them in the via clause.# we will use an ad-hoc construction method for the initial parameter# values that we can later extend to the 3- and 4-exponential fits.

interval = 10tau1 = 50tau2 = interval * tau1

finc1 = finc2 = finctotal/2

fit exp4(x) "edta_kinetics.csv" \using 1:2:(sqrt($2)) \via fbas, finc1, tau1, finc2, tau2

# lather, rinse, repeatinterval = 5tau1 = 20tau2 = interval * tau1tau3 = interval * tau2

finc1 = finc2 = finc3 = finctotal/3

fit exp4(x) "edta_kinetics.csv" \using 1:2:(sqrt($2)) \via fbas, finc1, tau1, finc2, tau2, finc3, tau3

# apply reconditioner. Here, I had to tweak the interval and tau1, because# the fit would abort with ’undefined value’ errors. That probably resulted# from attempted zero division. Changing the initial parameters will

Page 41: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

3.5. CODE AND DATA LISTINGS 33

# change all subsequent numbers computed during the fit, and so with# some trial and error one can sidestep this problem.# You can provoke an error with interval=5 and tau1=15.

interval = 5tau1 = 10tau2 = interval * tau1tau3 = interval * tau2tau4 = interval * tau3

finc1 = finc2 = finc3 = finc4 = finctotal/4

fit exp4(x) "edta_kinetics.csv" \using 1:2:(sqrt($2)) \via fbas, finc1, tau1, finc2, tau2, finc3, tau3, finc4, tau4

Page 42: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

Chapter 4

Protein structure visualization with Jmol and Pymol

4.1 Introduction

Protein structures

• are usually determined by X-ray diffraction analysis of protein crystals• can sometimes be determined by NMR, particularly with smaller proteins

Protein crystallization

• requires relatively large amounts of pure protein - has really taken off only oncerecombinant methods of protein expression became available

• is more difficult to achieve with membrane proteins; number of membrane proteinstructures lags behind that of soluble protein structures, but the situation ischanging

4.1.1 Why X-rays?

• Diffraction of X-rays by sodium chloride discovered by Max von Laue (Nobel prize1913); proved both the wave nature of X-rays and the crystal structure of sodiumchloride. Theory worked out by Bragg sen. and jun. (Nobel prize 1914)

• In general terms: Periodic assemblies (crystals) will diffract electromagnetic wavesby way of constructive interference if and only if the wavelength is similar to thespacing of the diffracting centers

• The wavelength of X-rays is similar to that of chemical bonds – γ-rays are tooshort, UV rays are too long

4.1.2 Is it easy?

Max Perutz

34

Page 43: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

4.1. INTRODUCTION 35

• was the first to tackle the structure of a protein crystal (myoglobin)• was declared a lunatic when he announced his intention to do so• worked more than 25 years to finish it up

Even this first crystal structure was only solved after computers had become available.The calculations involved are too much for humans.

4.1.3 Protein structure databases

1. The protein data bankrcsb.org/pdb/home/home.do

2. NCBI Pubmed ncbi.nlm.nih.gov/sites/entrez?db=structure

In the last couple of years, the number of protein crystal structures that have come outhas really exploded. There are now many structures available of proteins that have noteven been biochemically characterized.

4.1.4 Protein structure family relations

• Sequence homology families are a familiar concept• 3D-structural homology usually accompany sequence homology but may extend

even further, that is it may occur even between proteins that have no significantsequence similarity

4.1.5 The PDB data format

• Standard format for macromolecular structures• Text-based – human-readable, sort of, but usually disfigured by lots of computer-

generated tripe• Contains annotation on protein structure (α-helices, β-sheets, disulfide bonds)

that may be displayed by molecular viewers• Contains quite a bit of secondary information on experimental conditions, ci-

tations and the like that do not show up in molecular viewers, so it is oftenworthwhile to look over a pdb file with one’s own eyes

4.1.6 Software for molecular visualiation

Examples

1. Rasmol – excellent in its day and performs wonderfully on low-end hardware, butnow dated. The scripting language lives on in Jmol

2. Jmol – Java program that can run as an applet (inside web pages) and stand-alone.Scriptable with an extended version of the Rasmol scripting language

3. Pymol – Programmed with a mixture of Python and C++, very flexible, producesvery good images but clunkier than Jmol in some ways

4. Cn3d – the “official” viewer of the NCBI. I haven’t used it much, so can’t commenton its qualities

Page 44: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

36 CHAPTER 4. PROTEIN STRUCTURE VISUALIZATION WITH JMOL AND PYMOL

These programs are all freely available. Thorsten Dieckmann swears by another one,UCSF Chimera. From looking at its web page, it seems similar in power to Pymol, but Ihave no hands-on experience with it.

Jmol offers a nice balance of ease of use and capability, so that’s what we are going touse for our initial exercises.

4.2 Jmol

4.2.1 Jmol exercises

Open a shell window and cd to your jmol-practice directory, then enter the command

jmol chloroA.pdb

Jmol should start and show you something like this:

Click and drag with the mouse to rotate. Hold down the shift key, and double-click,then drag to shift the molecule. Roll your mouse wheel to zoom in and out. For now,quit Jmol.

4.2.2 The PDB file

Before having some more fun with Jmol, let’s look at the data file, chloroA.pdb. Type

less chloroA.pdb

to look at it. As you can see, it is quite human-readable, and it starts with a lot ofinformation, including the protein sequence and the regular secondary structure motifs(α-helices and β-sheets). The coordinates start with the first line prefixed with ATOM:

ATOM 5 N THR A 1092.241 155.870 190.344 1.00 33.86 N

Page 45: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

4.2. JMOL 37

4.2.3 The fields of the ATOM record

ATOM record describes a “regular” atom, not a hetero-atom5 running number of the atom (arbitrary)N atom name (relates to the residue)THR residue (threonine)A chain10 residue number (in this file, residues 1-9 are missing)92.241 x-coordinate (then y, then z)1.00 Occupancy33.86 temperature factor (mobility of the atom)N element

. . . we can use most of these fields to select the atom within Jmol.

4.2.4 A hetero-atom record

HETATM 5567 MG BCL 1 358.663 173.879 180.379 1.00 7.66 Mg

Now, let’s start up jmol again with the same command as before: jmol chloroA.pdb.

Hetero-atoms usually follow below the “regular” atoms, that is those that are part ofthe macromolecule itself. This HETATM record represents the first magnesium of thefirst chlorophyll molecule. Note that it has been assigned the chain name 1, as haveall molecules of chlorophyll associated with the protein chain A. Such decisions are upto the PDB file’s author; some PDB files are well-organized like this one, while othersaren’t.

BCL, I suppose, stands for “bacteriochlorophyll”. Again, such acronyms for prosteticgroups, drugs, or other ligands are made up on the spot, and the fastest way to find outabout them is just to look at the pdb file.

4.2.5 Tweaking the view

Bring up a Jmol-console: Right-click in the main window to bring up the context menuand then choose “console”. First, let’s change the background color:

background white

Let’s blow up the atoms to their (approximate) van der Waals size:

spacefill

We can also scale them up or down with for example

spacefill 50%

Or, we can assign them explicit diameters (in angstroms):

Page 46: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

38 CHAPTER 4. PROTEIN STRUCTURE VISUALIZATION WITH JMOL AND PYMOL

spacefill 2.0

Jmol has a menu. This is for wimps; real men use the console. (The menu also issomewhat limited, so you actually have no choice.)

Next, we want distinguish the protein from the hetero atoms:

select protein; color white; select bcl; color green; select water; colorblue

You should now see something like this (after some manual rotation with the mouse):

We don’t really need that second protein molecule. Let’s axe it, and while we are at it,get rid of the water molecules, too:

restrict (chain=a or chain=1) and not water

Now, if you rotate the molecule, it feels a little awkward, since the visible part of themolecule is off center. Click View→Define center in the menu to fix this.

4.2.6 Saving our hard work

We can

1. Save the current state in a Jmol script2. Save an image (screenshot)3. Export a povray script

An image is just a snapshot of the current display; it cannot be changed later fromwithin Jmol. Similarly, a povray script is for creating a still image; povray simply addssome 3D spiffiness to the image.

Saving a Jmol script is different – you can load up the script later and continue to modifythe display of the molecule.

4.2.7 Saving the current state

You can do so from the menu (File→Export→Write state) or from the console:

write state state1.spt

Did it work?

Page 47: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

4.2. JMOL 39

zap; load state1.spt

should delete the current view and then restore it from the saved file. You can use theload command anytime to revert to a previously saved state in case you goofed up.

4.2.8 Saving images

Two methods

• From the menu: Export→Export image• From the console: write image 2000 2000 chloroA.png

The console method has the advantage that you can increase the resolution, which isadvisable for printed documents.

or from the console. In either case, I recommend to use the PNG format for exporting,since it is widely compatible and gives better quality than the JPEG format. GIF is similarto PNG but more compact; it works in many applications but not with PDF-LATEX, whereasPNG and JPEG do.

4.2.9 Looking at protein folds

Let’s explore the molecule some more. How is it folded? Let’s inspect the backbone ofthe polypeptide chain.

restrict chain=a and backbone

A better display for this is:

spacefill off; wireframe 0.3

With

antialiasDisplay=true

the image will look nicer, but at the expense of a slower response to mouse movements.When saving images, the antialias switch seems to be set implicitly, so as long as youonly care about the exported images it’s not needed.

4.2.10 Folds. . .

Let’s highlight the secondary structure elements:

select helix; color blue; select sheet; color red

Another way to display the secondary structure is with the cartoon mode:

wireframe off; cartoon on

Our second protein molecule reappeared, since we did not exclude it explicitly in ourabove select commands. Get rid of it with

restrict chain=a

Save the current display state in fold.spt.

Page 48: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

40 CHAPTER 4. PROTEIN STRUCTURE VISUALIZATION WITH JMOL AND PYMOL

4.2.11 More on selections

We can select atoms using various kinds of atom expressions. Each of these can beused like so: select alpha, and they can be combined by boolean operators: selectburied and backbone, select basic or aromatic and so on.

• Role in structure: alpha, backbone, sidechain; hetero• Solvent: solvent, water• Properties of sidechains: surface, buried, acidic, aliphatic, basic, buried,charged, hydrophobic, neutral, polar

• spatial relationships; example: within(10, [trp]179)• Chemical element; example: element=“N”

4.2.12 Exercise: Try to reproduce this display

• One protein chain in white and cartoon mode, helices highlighted in pink• the associated chlorophyll molecules in wireframe and in different colors, with the

central magnesium atoms in spacefill and in blue

4.2.13 Hints

To select individual chlorophyll molecules, you can do

select group="bcl" and resno=3

and so on. . . as you can glean from the pdb file, the ones associated with chain A arenumbered 3–9. After selecting each group, apply

color red

and so on. If you run out of colours, try cornflowerblue, cyan, fuchsia, lime, orchid,peachpuff, pink, purple, salmon, turquoise, violet. . .

Select helices with

select structure="helix"

When done, save your state, and save a picture.

Page 49: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

4.3. PYMOL 41

4.2.14 And another one

To generate the surface for chain A, use

isosurface select(chain=a) sasurface; color isosurface white

To cut away the front of the molecule, use

slab on; slab 50

The command sasurface means solvent-accessible surface.

Larger or smaller numbers for the slab command will cut away more or less from themolecule.

4.2.15 And a last one

Use slab off to restore the molecule. Make the surface translucent: color isosurfacewhite translucent

Try to render this picture with povray. To use povray, you need to have it installed ;)Jmol is supposed to be able to run povray itself, but I couldn’t get it to work. I endedup saving the povray script to a file (which you can do from the Jmol menu) and theninvoking povray manually.

4.3 Pymol

Differences to Jmol:

Page 50: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

42 CHAPTER 4. PROTEIN STRUCTURE VISUALIZATION WITH JMOL AND PYMOL

1. Written in Python and C++, not Java – installation on different platforms can be abit more cumbersome

2. Doesn’t run inside web pages3. More advanced graphics capabilities4. License: If you visit pymol.org, it looks as if it’s commercial – but the code is

actually open source and will stay this way

4.3.1 Documentation

A wiki – current, but not quite complete http://www.pymolwiki.org

A manual from the programmer himself – oldish, but still adequate for basic topics:http://pymol.sourceforge.net/newman/userman.pdf

Assorted tutorials collected from the web: http://watcut.uwaterloo.ca/chem731/2011/pymoldocs/

4.3.2 The GUI

external GUI

internal GUI

The GUI is split acrosstwo windows. This is somewhat clunky, but it does permit a larger view of the molecule,as you can maximize that window separately.

4.3.3 Opening files

1. From the menu (File→Open)2. From the command line:

load file.pdb3. Directly from the protein database (while online):

fetch 7ahl (no .pdb extension in this case!)

Commands can be entered either in the internal or external GUI.

You can specify multiple names to load multiple structures in one go (for example foralignments, see later). Note that structures that you fetch from the web also get saved

Page 51: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

4.3. PYMOL 43

locally into the current directory, so make sure you are in the right one before fetchingstuff.

4.3.4 Working with single structures

To create images, the basic workflow is similar to Jmol:

1. Arrange the molecule in space, using the mouse• Left mouse button rotates• Middle button or wheel moves in XY plane• Right button zooms

2. Select parts of the molecule. Note, however, that selections are named in pymol3. Apply formatting instructions to selections4. Save image

4.3.5 Exercise: HIV protease with the inhibitor saquinavir bound to it

Load the molecule:

load 2nnp.pdb

You should now see the structure, and an entry representing it in the lower window:

Click S→as→spheres to display everything in spacefill mode

Click A→remove waters The A S H L C buttons are menus that allow you to workwith the entire structure. For each selection you create (see below), you get a new set ofbuttons that apply the same operations to this selection only.

4.3.6 What are virus proteases, anyway?

Page 52: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

44 CHAPTER 4. PROTEIN STRUCTURE VISUALIZATION WITH JMOL AND PYMOL

Some important viruses such as hepatitis C virus and human immunodeficiency virus(HIV) translate their genomic nucleic acids into polyproteins. The domains of a polypro-tein are then cleaved from one another to become the mature virus proteins.

As such, polyproteins are inactive. The only component that is active within the polypro-tein is the protease that first cleaves itself and then the other components, which thenpart ways to serve their respective roles in virus replication and assembly. Proteaseinhibitors such as saquinavir prevent this necessary maturation and therefore blockvirus replication. They are effective in the treatment of virus infections.

4.3.7 Saving a cleaned-up version of the molecule

Display heteratoms:

• Activate the sequence view: click on the S button close to the bottom right cornerof the bottom window

• In the sequence view, use the scroll bar to navigate to the right end• You should now see the following: ROC ACY ACY SO4 GOL GOL• Select ACY ACY SO4 GOL GOL• Click S→as→spheres to verify you have the right selection (just a bunch of super-

ficially associated small molecules)• Click A→Remove atoms• Save the molecule: File→Save molecule, or type save 2nnp_cleaned.pdb

Molecules like salts, glycerol and detergents are commonly used in crystallography tofacilitate crystallization. They often have no real meaning for the biological activity ofthe protein as such. In our example, ACY is acetate, and GOL is glycerol. SO4 I’m sureyou can guess. ROC is the actual ligand (the drug saquinavir), so we want to keep it.

4.3.8 Visualizing structure elements

In the menu of the selection object “all”, click C→by chain→by chain

You should now see that the molecule contains two polypeptide chains, which betweenthem enclose a drug molecule (saquinavir).

4.3.9 Saving state

• From the command line: save 2nnp-state.pse• From the menu: Save session

Saving state frequently is advisable, since many operations in Pymol can’t be undone.

The save command is the same as used above for the cleaned-up pdb file. Pymol infersyour intention from the file extension. The extension pse represents a pymol session.Unlike the Pymol script files, the session files are not editable, but they do have theadvantage to save the complete program state, not just those parts of the state that wascreated from the command line.

Page 53: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

4.3. PYMOL 45

4.3.10 Selections

Before we can apply further prettification, we first must get hold of the components ofthe structure. For saquinavir, we can use the sequence view again:

1. Click on ROC (now at the far right)2. In the “sele” menu, click A→rename selection, then type “saquinavir”.

For the polypeptide chains, it is easier to use the command line:

• select chain_a, chain a• select chain_b, chain b

Save state: save 2nnp-state.pse In Pymol, selections are persistent – you can havemultiple selections, each of which has a name. While this is not very important withtrivial selection criteria as the ones created in this example, it is really useful with morecomplicated criteria. For example, we can select the backbone atoms of polypeptidechains like this:

select backbone, name c+o+n+ca

We could then narrow down this selection to specific chains:

select backbone_ab, backbone and (chain a or chain b)

At this level of complexity, persistent selections begin to make sense. Also, even trivialselections as the ones shown here have the advantage that they give you a ASHLC menubar.

Selections are retained in your session when you save it as a .pse file.

4.3.11 Prettyfication

• Apply S→as→surface to chain_a and chain_b• Apply S→as→sticks to saquinavir

This view nicely illustrates how the drug molecule fills the active site of the HIV pro-tease.

Page 54: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

46 CHAPTER 4. PROTEIN STRUCTURE VISUALIZATION WITH JMOL AND PYMOL

4.3.12 Producing high-quality figures

Invoke the built-in raytracer:

ray

Enhance the resolution with e.g.:

ray 2000

Smooth the edges:

set antialias 3

To apply this antialias value, you have to invoke ray again.

In the command ray 2000, the number specifies the horizontal resolution. A value of2000 should be o.k. for printed documents

Possible values for antialias are 0 to 4. For printed output, it should be 2 or 3. Avalue of 4 burns lots of CPU cycles, but I haven’t seen a great improvement over 3.

Combinations of high values for antialias, image resolution and surface transparencymay test the limits of your hardware.

4.3.13 Driving Pymol with scripts

Pymol understands two languages:

1. Python – after all, it is partially programmed in it.2. It’s own scripting language

Let’s try it: Run

@gyrase.pml

. . . and wait, and wait some more. . . You need Python if you want to extend Pymolsabilities. Python programmed extensions are often packaged as plugins.

The Pymol scripting language is the same that we also have used at the command linewithin pymol (load, save, select . . . ) and is documented in the user and referenceguides. This language is more straightforward to use and suffices to control the built-incapabilities of Pymol.

What is the advantage of scripting over interactive usage? It depends on the scale ofusage. As long as you just need one or two figures, interactive usage is fine. Scriptingcomes into play if you for example need to produce many figures in a consistent style.You could for example create a script file that selects the backbone of each polypeptidechain and displays it in a consistent way, and then run this script over each structure.

Page 55: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

4.3. PYMOL 47

4.3.14 The image produced by gyrase.pml

4.3.15 What is DNA topoisomerase anyway?

The gyrase.pml script displays human DNA toposisomerase I, in complex with DNAand the topoisomerase inhibitor topotecan. What do DNA gyrases and topoisomerasesdo?

This picture illustrates the degree to which DNA is curled up inside the cell: On the left,the packed form of the bacterial chromosome is shown (the light area in the center ofthe cell), whereas on the right side the DNA is spilled out of the cell.

Transcription and replication of DNA require it to be unpacked or unwound. In botheukaryotic and prokaryotic cells, this is accomplished by DNA topoisomerases I and II.

Page 56: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

48 CHAPTER 4. PROTEIN STRUCTURE VISUALIZATION WITH JMOL AND PYMOL

4.3.16 The reaction catalyzed by DNA topoisomerases

This figure illustrates the basic function of topoisomerases I and II: A DNA moleculeis cut, and the free ends are moved past the other DNA molecule and then rejoined.In the case of DNA topoisomerase I, the two DNA molecules are single strands. DNAtopoisomerase II applies the same operation to double strands, that is it cleaves bothstrands of one double helix and moves the free ends past another double helix. Bothactivities are needed for transcription and translation.

Inhibitors of DNA topoisomerases are used in both antibacterial chemotherapy andin tumour therapy. Irinotecan (shown in the figure above) is an inhibitor of humantoposomerase I that is used in the treatment of cancer.

4.3.17 Understanding script files

Read through the file gyrase.pml – see comments for explanations.

Listing 4.1: The gyrase.pml script for Pymol (pmpractice/gyrase.pml)

# dna gyrase with dna and topotecan

# clear out and load filereinitializeload 1k4t.pdb

# don’t display anything while the settings are adjustedhide everything

# surface transparencyset transparency, 0.6

# thickness of sticks in stick displayset stick_radius, 0.2

# diameter of spheresset sphere_scale, 0.6

# illumination for the ray tracer

Page 57: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

4.3. PYMOL 49

set direct, 0.2set fog, 0.5

# define sub-structure selectionsselect drug, resn TGPselect protein, chain aselect dna, (chain b or chain c or chain d) and not drug

# create a dummy selection to deselect the dnaselect dummy, chain z

# define colors for the sub-structurescolor gray80, protein

# color one dna strand black, the other graycolor black,dnacolor gray70, chain dcolor gray50, drug

# set the view coordinates. These were copied from an# interactive Pymol session. Click "Get view" in the# outer (top) GUI window to get the current coordinates.set_view (\

-0.316715509, -0.014092376, -0.948415995,\-0.048230056, 0.998836935, 0.001265566,\0.947293043, 0.046144385, -0.317026347,\-0.000038713, 0.000031844, -265.550964355,\21.163307190, -1.675732613, 40.628482819,\191.892059326, 339.233642578, 0.000000000 )

# white background - looks so much better in printbg_color white

# display everything according to the settings aboveshow surface, proteinshow spheres, drugshow sticks, dna

Page 58: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

Chapter 5

Sequence analysis

5.1 Introduction

What is it good for?

• Genome sequences has provided boatloads of information• Many sequences encode proteins that have not yet been biochemically character-

ized• The function of such uncharacterized proteins can often be inferred by compari-

son to known sequences and sequence motifs

5.1.1 Sequence analysis resources: Starting points

Web-based:

• Gene and genome databases accessible through NCBI: http://www.ncbi.nlm.nih.gov/

• Directories of analysis tools at EBI:http://www.ebi.ac.uk/and Expasy:http://ca.expasy.org/

Local:

• The EMBOSS suite of programs• Look around in your package manager (select sections science, then search for

“sequence”)

For one or a few sequences, the on-line resources are sufficient. If we want to analyzeand compare large numbers of sequences, it can be useful to download them and runthe analysis locally.

50

Page 59: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

5.2. EXERCISES 51

For our exercises, we will use the EMBOSS suite. Some additional exercises will be partof the sessions on Python programming.

5.2 Exercises

5.2.1 Proteins of unknown function in the Saccharomyces cerevisiae (baker’s yeast)genome

File yuk.fasta contains the sequences of all proteins that were uncovered by genomesequencing yet had not been characterized biochemically before.

less yuk.fasta

The sequences are listed in the so-called FASTA format – the first line starts with “>”and contains name and description, followed by the protein sequence in single lettercode.

How many sequences?

grep -c ’>’ yuk.fasta For our exercises, I have compiled a file that contains all

sequences of proteins with unknown function from the genome of Saccharomyces cere-visiae. This file dates back to 2009 – some of the sequences may have been biochemicallycharacterized meanwhile.

5.2.2 Sequence composition and inferred properties

pepstats -outfile yuk.pepstats yuk.fasta

Have a look at the results:

less yuk.pepstats

Some predictions are more reliable than others . . . The molecular weight should beaccurate, except that it does not take into account post-translational modifications(cleave, glycosylation, acylation). The absorbance at 280 nm should be accurate towithin a few percent.

The isoelectric point should be a reasonable approximation, whereas the “Improbabilityof expression in inclusion bodies” looks a bit funny at 3 significant digits.

Page 60: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

52 CHAPTER 5. SEQUENCE ANALYSIS

5.2.3 Secondary structure prediction

• α-helix: More sterically hindered, preferred by aa with smaller side chains• β-sheet: More room, preferred by aa with side chain bulk close to the backbone

prefer β-sheet

The standard amino acids and some of their properties, including preference for α-helixor β-sheet structure, are listed in table 5.1 on page 57.

5.2.4 Secondary structure prediction ctd.

Find a program to use:

wossname secondary

Read its documentation:

tfm garnier

Run it:

garnier -outfile yuk.garnier yuk.fasta

Examine the output:

less yuk.garnier The EMBOSS suite comes with a whimsically named utility, wossname,

which searches the documentation of all EMBOSS programs for a keyword, and lists allprograms that contain it. The tfm utility displays the full documentation for any EM-BOSS program.

5.2.5 Searching for sequence motifs

The concept of sequence motifs applies to both nucleic acid and protein sequences. Wecan distinguish

• Structural motifs (for example combinations of secondary structure elements)• Functional motifs: Binding sites, target sites of enzyme action

The concept of sequence homology is the foundation of pretty much everything elsein sequence analysis. Simply put, functionally similar genes and proteins should havesimilar sequences, the more so if the source organisms are phylogenetically related.

Page 61: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

5.2. EXERCISES 53

Indeed, the extent of sequence homology between genes or genomes is currently thegold standard for establishing phylogenetic relationships.

The concept of structural motifs overlaps with that of functional motifs, so let’s nottry too hard to draw an artificial line. However, structural motifs do not necessarilyimply a high degree of sequence similarity. Instead, they may simply consist of clustersof amino acids with similar preference for a given secondary structure (α-helices orβ-sheets, respectively; table 5.1) , or they may combine a succession of helical and sheetmotifs into a higher order structure.

Functional sites often require some structural context, for example they must be ex-posed on the surface of the protein molecule in order to be accessible – so predictionbased on sequence will generate some false positives.

An exhaustive collection of functional and structural protein motifs is maintained inthe Prosite database, which also points to proteins that contain the sites in question.

5.2.6 Sequence motifs are expressed as consensus motifs

An example: The consensus motif for active sites of serine proteases

less serineprotease

The characters [LIVM]-[ST]-A-[STAG]-H-C mean: A leucine, isoleucine, valine ormethinione, followed by serine or threonine, followed by alanine, . . .

This consensus motif is described in the syntax that is used in the Prosite database, andis also understood by the fuzzpro program (see below).

5.2.7 How do we find motifs?

List suitable programs:

wossname motif

In this list: fuzzpro – the pattern syntax in file serineprotease and a few others—downloaded from prosite—is the one expected by fuzzpro.

In theory, we could run

fuzzpro

But that would require us to type the longish motifs. We don’t want that. Enter shellscripting:

less fprun The fprun script simply takes the name of a file that contains a search

pattern, reads the file content and constructs the full fuzzpro command for us.

Inside the script, reading the file contents is done with cat, and the output of the catcommand is captured with the backticks. We can apply the same steps directly:

cat serineprotease

Page 62: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

54 CHAPTER 5. SEQUENCE ANALYSIS

Insert the file content into a command using backticks:

fuzzpro -sequence yuk.fasta -pattern -outfile jnk `cat cholesterol`

5.2.8 Searching sequence motifs

Run fuzzpro via fprun:

./fprun efhand

Inspect results:

less efhand_results

Try the same with the motif file shortchain.

Search for putative cholesterol binding motifs:

echo "[LV]-X(1,5)-Y-X(1,5)-[KR]" > cholesterol

fprun cholesterol

5.2.9 The CAAX box motif

• Causes C-terminal farnesylation (attachment of hydrophobic moiety)• Farnesylated proteins stick to membranes• Cysteine, two aliphatics, one arbitrary, then end (C-terminus)

How to search for it? The difficulty here is that the CAAX box is supposed to be locatedat the C-terminus. I have not found a way to instruct fuzzpro to limit the search bylocation.

Another program for motif search is preg. This program uses a more powerful syntaxthat also allows us to specify location. We can use it like so to search for the CAAX box:

In this command, [ILV] represents any of I, L or V (aliphatics), the 2between braces denotes 2 of the foregoing, and A-Z denotes any letter (X).The dollar sign represents the end of the sequence. Thus, we will only capturethe CAAX pattern if it runs right up to the C terminus.

5.2.10 Comparing sequences

wossname compare

Hm. Widen the search a bit . . .

wossname compar

There. seqmatchall is what we want.

seqmatchall -outfile matched -wordsize 20 yuk.fasta

Page 63: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

5.2. EXERCISES 55

That will take a little while. When done, sift through the file with less. Take note ofthe names of some pair of matched sequences. Here, we compare all sequences in thefile against one another, and obtain all pairs in which there is one or more identicalstretch of 20 or more amino acids. That involves quite a bit of data churning, and soseqmatchall takes a little while.

5.2.11 Aligning sequences

Extract two matched sequences: seqret

When prompted for the sequence to read, type something like:

yuk.fasta:NP_116593.1

and then, for the output file:

1.seq

Repeat this for the second file, giving 2.seq as the file name. Then do

cat 1.seq 2.seq > both.seq

clustalw For a more detailed examination of similarity between any two sequences,

we can do a sequence alignment with clustalw. This procedure searches not just foridentity but also for similarity, and it tries to arrange the two sequences in such a waythat as many residues as possible are matched up with an identical or similar residuein the other molecule.

Page 64: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

56 CHAPTER 5. SEQUENCE ANALYSIS

Table 5.1: Amino acids and some of their properties

Amino Acid 3-Letter 1-Letter Polarity Charge SS other

Alanine Ala A nonpolar 0 αArginine Arg R polar + αAsparagine Asn N polar 0 neitherAspartic acid Asp D polar – neitherCysteine Cys C nonpolar 0 β disulfidesGlutamic acid Glu E polar – αGlutamine Gln Q polar 0 αGlycine Gly G nonpolar 0 neitherHistidine His H polar 0 αIsoleucine Ile I nonpolar 0 β aliphaticLeucine Leu L nonpolar 0 α aliphaticLysine Lys K polar + αMethionine Met M nonpolar 0 α(β)Phenylalanine Phe F nonpolar 0 α(β) aromaticProline Pro P nonpolar 0 breakSerine Ser S polar 0 neitherThreonine Thr T polar 0 βTryptophan Trp W nonpolar 0 β aromaticTyrosine Tyr Y polar 0 β aromaticValine Val V nonpolar 0 gb aliphatic

Page 65: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

Chapter 6

Molecular docking

6.1 Introduction

• Purpose: Find binding sites for small molecules on protein receptors• Method: Position and conformation of the ligand is randomly varied, and the

binding energy is estimated for each variation• Widely used for in silico screening on the interactions of existing or hypothetical

small molecules with drug targets• Requires crystal or NMR structures of the receptors• Various commercial and free software implementations. We will use Vina

The receptor can be treated as rigid or conformationally flexible; the latter increasescomputational cost. In high-throughput applications, that is when screening a largenumber of compounds, the receptor is therefore typically treated as rigid. We will adoptthe same approach in our exercises.

6.1.1 Overview of the procedure

• Use Autodock tools to prepare input files for receptor and ligand• define search box (limit the space in which the ligand can hunt for a binding site)• Run Autodock Vina to perform the docking• Examine output in Pymol

We need to install Vina (http://vina.scripps.edu) and MGLtools (http://mgltools.scripps.edu).

The input files for the docking program are derived from .pdb files. The latter onlycontain molecular coordinates but no information about charges, and they typically alsolack the hydrogen coordinates. Charges and hydrogen bonds are important in binding,so we need to supply this information. We produce it with Autodock tools.

57

Page 66: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

58 CHAPTER 6. MOLECULAR DOCKING

6.2 Exercise: Docking imatinib to abl protein tyrosine kinase

This exercise is a recapitulation of a video tutorial that is available on the Vina website.A little background:

• abl kinase is a mutant receptor tyrosine kinase that causes chronic myeloicleukemia (CML)

• Imatinib is a tyrosine kinase inhibitor that is used against leukemia and some solidtumors

• Good example of structure-based drug design

6.2.1 Preparing the receptor input file

1. In a console, cd to the tutorial folder2. Start autodock tools: adt3. Menu File→Read Molecule→receptor.pdb4. Add hydrogens: Edit→hydrogens→add→polar only→OK.5. Apply: Grid→Macromolecule→Choose→OK.

Save the resulting file as receptor.pdbqt.

It is assumed that both autodock tools and Vina have been installed and are on yourshell’s PATH.

The .pdbqt file that is produced in this step assigns the charges to the basic and acidicamino acid side chains. I also adds the hydrogen coordinates.

When adding the hydrogens, we choose “polar only”. The apolar ones will then betreated by Vina by way of pseudo-atoms. For example, a methyl group is treated as ifit where a single atom, with a volume that includes both the central carbon and thethree hydrogens attached to it. This simplifies the calculations considerably, withoutsacrificing too much accuracy. In contrast, polar hydrogens (on –OH groups for example)must be treated explicitly and individually, since they can engage in hydrogen bonding.

6.2.2 Preparing the ligand input file

1. From the toolbar, execute: Ligand→input→open→drug.pdb→OK. This will loadthe pdb file and automatically assign a polarity to each atom.

2. Hide protein in the dashboard panel (the white area), zoom and center ontoligand. Zoom and rotation work with the mouse wheel; movement in the planeworks with the right mouse key.

3. Bond rotations: From the toolbar, executeLigand→Torsion tree→choose torsions. Bonds considered rotatable are high-lighted in green, non-rotatable ones in magenta.There is one non-rotatable bond that is next to a phenyl ring; click on it to make itrotatable. Done.

4. Save: From the toolbar, execute Ligand→output→pdbqt, save file as drug.pdbqt.

When loading the ligand, hydrogen atoms get assigned automatically, so we don’t haveto do it manually in this case.

Page 67: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

6.2. EXERCISE: DOCKING IMATINIB TO ABL PROTEIN TYROSINE KINASE 59

6.2.3 Defining the search area

1. Toolbar – Grid→Grid box. This brings up a dialog with cheesy “thumbwheel”controls, and a cube in three colors. This cube represents the search area.

2. First, adjust the units: Turn up the “spacing” thumbwheel to 1.0 by stroking itfrom left to right with the mouse.

3. Adjust the center coordinates. Don’t use the thumbwheel controls for these, butthe text fields; update the display with <enter> (Groan. So user friendly.)

4. Adjust the dimensions. Note that the colors of the controls and of the surfaces ofthe box correspond to one another.

Restricting the search area in which Vina is supposed to look for docking sites avoids alot of unnecessary computation.

6.2.4 Create the Vina configuration file

center_x = 15center_y = 50center_z = 20

size_x = 30size_y = 30size_z = 30

receptor = receptor.pdbqtligand = drug.pdbqt

log = log.txtexhaustiveness = 20cpu = 1

Vexingly, the program does not let us save our hard work directly. So, we read thecoordinates of the box from the screen and type them into a text file. While we are at it,we also add the names of the receptor and ligand files that we want to dock.

The log file will contain messages that vina produced during its run. If all goes well, wecan ignore it. The exhaustiveness parameter, here set to 20, can be given higher valuesfor more thorough optimization. The cpu parameter specifies the number of CPUs thatyou want to let Vina use. If you plan on doing other stuff while Vina is running, keep atleast one to yourself (for example, let Vina use 3 out of 4 available CPUs).

Save this file as vina.cfg and you are ready to run the docking.

6.2.5 Run Vina

vina --config vina.cfg

Page 68: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

60 CHAPTER 6. MOLECULAR DOCKING

This will take a while; a progress bar indicates how long the progam will take to execute.After it terminates, you should have a file drug_out.pdbqt; this should contain a seriesof docking configurations, with the energetically most favourable one at the top.

6.2.6 Inspect the results in Pymol

• Load the file drug_out.pdbqt• Load the receptor file and the X-ray coordinates of the ligand• Highlight and contrast the experimental drug molecule with the one obtained by

docking• Move through the various docked conformations with the arrow buttons at the

bottom right of the lower Pymol window

The best conformation should be pretty close to the one obtained by X-ray crystallogra-phy.

Page 69: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

Chapter 7

Python programming

7.1 Introduction

Why? Let’s ask the German poet Friedrich Schiller:

Die Axt im Haus ersetzt den Zimmermann.

Translation:

An axe at home is worth a carpenter.

7.1.1 Python vs. Gnuplot or LATEX

• All are programmable systems• Gnuplot and LATEX are adapted to special purposes, Python is a general purpose

language – manipulates arbitrary data in arbitrary ways• Python can be used for text processing, serving web content, number crunching,

image processing – you name it

7.1.2 Is programming easy?

Yes. . .

• Simple tasks can be accomplished with simple programs• Many code libraries are available that we can use as building blocks for our own

programs with little effort

No. . .

• Programs cannot be simpler than the problems they aim to solve

61

Page 70: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

62 CHAPTER 7. PYTHON PROGRAMMING

• Large programs need careful and sound design in order to remain manageable

In this first session, we will go over a few elementary concepts. Nevertheless, you willsee that even with very basic elements we will manage to write a program that translatesa DNA sequence to a protein sequence.

7.1.3 Why Python?

• Relatively easy to learn – increasingly used as an introductory language in univer-sity classes

• Emphasis on readability and on sane, consistent syntax• Good allround capabilities• Many libraries available for scientific computing• Well-designed – suitable both for small, one-off scripts and large, complex pro-

grams• Well-behaved when program errors occur – essential, since errors happen often,

particularly during development

Python is nice but it does have some idiosyncrasies. As an alternative, you mightconsider Ruby, which seems a bit simpler in several ways. However, it does not have thesame range of libraries for scientific computing. On balance, Python is a better choice.

7.1.4 How Python programs are created and executed

1. Python code is written and saved in a simple text file named something likemyprogram.py

2. We ask Python to execute it: python myprogram.py3. Python reads the text file and translates or compiles it into an intermediate (byte

code) format4. Python executes the byte code

For creating Python code files, we can use any text editor, but it will be helpful to use onethat assists us with Python-specific syntax highlighting. The editors Gedit (part of theGnome desktop) or Kate (part of the KDE desktop) are good enough and straightforwardto use. While we are at it: You should configure your editor to use tab-stops 8 spaceswide, and to insert spaces instead of tabs for indentation. Indentation is important inPython, and these settings are most widely used and recommended.

The execution model described above makes Python an interpreted language, as opposedto e.g. C or Fortran, which are compiled languages. One practical consequence is that wealways need to have Python installed not only to develop but also to run Python code.Another consequence is that Python code often runs slower than C or Fortran.

In certain situations, Python will store the byte-code format in a separate file, such asmyprogram.pyc; it will update such files as needed, so we can just leave them alone.

Page 71: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

7.2. FIRST STEPS 63

7.2 First steps

Bring up your text editor and type:

print "Hello, world!"

Save the file as hello.py and run it in a console:

python hello.py

Instead of calling python ourselves, we can also tell bash to do it for us, by insertingthe so-called “hash-bang” line at the very top of hello.py:

#!/usr/bin/python

After making it executable (chmod +x hello.py), we can then invoke the file like this:

./hello.py If we insert the hash-bang line and store the program in some folder

on the PATH, we can invoke it from anywhere, just like gnuplot or any other script orprogram.

7.2.1 Python’s interactive mode

If you just type

python

you will see something like this:

Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48)[GCC 4.4.5] on linux2Type "help", "copyright", "credits" or "license"for more information.>>>

Type print 2 + 3 and enter. The code is executed immediately, and the result isdisplayed.

Type <ctrl-d> to exit. The interactive mode can be useful for trying things out, butfor anything that is longer than one or two lines you want to save the code as a file, sothat you can change it and observe the effect of your changes.

7.2.2 Naming pieces of data: Variables

a = 2b = 3

c = a + bprint c

Page 72: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

64 CHAPTER 7. PYTHON PROGRAMMING

d = c**aprint d

firstname = ’john’lastname = ’doe’

print firstname + ’ ’ + lastname

Variables are essential in any kind of programming language. They can be made upfrom thin air and used right away; this works much in the same way as we have usedthem in Gnuplot. Variable names must start with a letter or underscore and can becontinued with letters, underscores and digits. For example, joe, _joe_123 are validnames, while 123joe is not. Also note that variable names, or in fact all names inPython, are case-sensitive – to Python, joe is distinct from Joe and from JOE.

You can use almost any word to name a variable. It is advisable to use meaningfulvariable names – a variable name should reflect the meaning of the data it contains.Choosing good names can make a big difference to the readability of your code. Thisalso goes for the names of functions or objects (see later).

7.3 Keywords and builtins

7.3.1 Some names are special

If we try:

class = ’destroyer’

we get

File "<stdin>", line 1class = ’destroyer’

^SyntaxError: invalid syntax

This is because class is a Python keyword – a word that is reserved by Python for itsown use.

7.3.2 Python keywords

and as assert breakclass continue def del

Page 73: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

7.4. DATA TYPES 65

elif else except execfinally for from globalif import in islambda not or passprint raise return trywhile with yield

. . . not meant for immediate memorization. Your editor should highlight them in boldface or in some special color. These words have a fixed meaning and cannot be used inany other way. If you try, Python will flag a syntax error, just as in the example above.

7.3.3 Python built-ins whose names are not protected

dir(__builtins__)print reloadreload = 55print reload

Save this file as junk.py and run

pychecker junk.py If you inadvertently rebind the name of a built-in, your own code

may work fine, but only as long as the code in the libraries you may be using does notdepend on the original meaning of that name.

I fail to see the benefit of this – flexibility is good, but allowing the user to clobberbuilt-in names is overdoing it. So, watch out for it. Happily, Pychecker helps you withthat – so it is a good idea to run it over your code files, particularly while you are stilllearning Python.

Good editors should also recognize built-in names and colorize them as part of thesyntax highlighting. If a name you chose unexpectedly changes color, modify it until itdoesn’t.

7.4 Data types

Try out the following in an interactive python prompt:

a = 5b = 3

c = ’john’d = ’doe’

print a + b # 8

Page 74: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

66 CHAPTER 7. PYTHON PROGRAMMING

print c + d # ’johndoe’

print a + c # throws TypeError

print "type(a):", type(a) # <type ’int’>print "type(c):", type(c) # <type ’str’>

Variable names can be made up from thin air and used right away; this works in thesame way as we have seen in Gnuplot.

Variable values have types that determine what operations can be applied to them. Inthe above example, the + operator effects addition of numbers, and concatenation ofstrings (words). If we try to apply + to a string and a number, Python determines thatthe two operands are of different type and that the + operation between them is notdefined. It therefore “throws” a TypeError exception.

7.5 Working with more data: Containers

Names of individual pieces of data are useful, but we also need to work with larger andvariable amounts of data. For this, we use containers. The most important containcersare lists and dictionaries. A list can be created as follows:

7.5.1 Lists

a = [1, 2, 3, 4, 5, 6, 7]print a

a is a list.

a.append(’joe’)print a

a.pop()print a

del a[5]print a

a is mutable – we can change its contents in place. Lists can contain arbitrary objects –numbers, first names, other lists, whatever – alone or in combination.

Page 75: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

7.5. WORKING WITH MORE DATA: CONTAINERS 67

7.5.2 How variables work with mutable objects such as lists

a = range(5)print a # a is a list

b = a # b and a now point to the same listprint b

c = list(a) # c is a copy of aprint c

a.pop()

print aprint bprint c

Assignments like a = b create a second reference to the same object – if we change theobject through one reference, we will see the change through the other one also.

As in Gnuplot, we can insert comments into Python code. Anything preceded by a #character is ignored. You should get into the habit to insert comments into your codethat explain what the code is supposed to be doing, or why you chose this particularway of doing things over another.

7.5.3 Testing for identity and equality

a = range(5) # create a listb = a # create another reference to itc = list(a) # create a copy of the list

print ’a equals b?’, a == bprint ’a equals c?’, a == c

print ’a same data as b?’, a is bprint ’a same data as c?’, a is c

print id(a)print id(b)print id(c)

As we have just seen, with mutable objects such as lists, the distinction of equality andidentity becomes important. Identity can be tested for with the is keyword, whereasequality is tested with the == operator. Identity implies equality, but not vice versa.

Notice the difference between assignment (a = 5) and comparison (a == 5). Assign-ment sets a variable to a new value, whereas comparison tests the current value forequality to another.

Page 76: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

68 CHAPTER 7. PYTHON PROGRAMMING

The difference between equality (==) and identity (is) is as follows: Two lists (or otherobjects) are equal if they contain the same value or values. They are identical only ifthey are one and the same, stored at the same location in memory. If we create a newlist by copying an existing one, for example with the list function, the original and thecopy will be equal, but not identical, since the copy is stored in a new place in memory.

By default, Python does not create a copy – an assignment like b = a will always simplygive us a new reference to the same data. If we need a copy, we must create it explicitly.

7.5.4 List slicing

a = range(10)print a

b = a[0:5]print b

c = a[5:-1]print c

d = a[:]print d

e = a[::2]print e

List slicing is an efficient way to extract parts of a list. Slicing also works with strings(see later).

7.5.5 Tuples

a = (1,2,3,4) # make a tuple from scratchprint type(a)

b = (1) # does NOT make a tupleprint b, type(b)

c = (1,) # make a tuple with one elementprint c, type(c)

print a[1:3] # tuples can be sliceda.append(5) # fails - tuples are immutable

b = list(a) # make a tuple from a listc = tuple(b) # make a list from a tuple

Page 77: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

7.5. WORKING WITH MORE DATA: CONTAINERS 69

Tuples are similar to lists but immutable – you can’t add elements to tuples or removethem. There are a few cases in which tuples must be used instead of lists; we will get tothat.

7.5.6 Sets

a = ’john doe was born in 1806’.split()print a # a is now a list of stringssa = set(a) # create a set from aprint sa

b = list(a) # copy ab.reverse() # reverse order of elementsprint b

sb = set(b)print sb

c = a * 2 # merge two copies of a

sc = set(c)print sc

print sa == sb == sc

The first line illustrates the split method of string objects. Strings are just pieces ofliteral text. In Python, they are objects, that is they contain both data (the text itself) andcode that we can use to operate on those same data. A unit of code that is attached toan object is called a method. We will get back to this topic later.

The remainder of this example illustrates the key properties of sets:

1. Order does not matter, and2. each element occurs only once.

Sets are useful if you want to ensure that each single piece of data in a collection isunique.

7.5.7 Dictionaries

phonenumbers = {’john’: 911, ’jane’: 119, ’bill’: 191}

print phonenumbers[’john’] # look up john’s phone number

phonenumbers[’jim’] = 919 # assign new value to new keyprint phonenumbers

Page 78: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

70 CHAPTER 7. PYTHON PROGRAMMING

phonenumbers[’john’] = 119 # assigning to an existing keyprint phonenumbers # overwrites the previous value

print phonenumbers.keys() # get a list of all keysprint phonenumbers.values() # get a list of all valuesprint phonenumbers.items() # list of all key-value pairs

Like lists, dictionaries are a real workhorse and are used all over the place in Pythonprograms. Dictionaries, or dicts for short, let us connect arbitrary pieces of data to oneanother. The example above shows that each key can only occur once – if we assign anew value to an existing key, the old value is forgotten.

In contrast, values need not be unique. In our example, John moved in with Jane, andthe same phone number was listed with both their names.

7.5.8 Tuples vs. lists as dictionary keys

\phones = {(’joe’,’home’) : 119,(’joe’,’work’) : 191,(’jane’,’home’) : 119,(’jane’,’work’) : 911}

\phones = { # this won’t work[’joe’,’home’] : 119,}

If we need to combine several pieces of data into one dictionary key, we must use tuples.This is one place where lists simply won’t work.

7.6 Repeated execution: Loops

a = range(10, 30, 5) # [10, 15, 20, 25]

for x in a: # x adopts the value of eachprint x # item in the list in turn

y = 10while y > 0: # condition is tested before

print y # each run of the loopy = y- 1 # decrement y

Page 79: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

7.6. REPEATED EXECUTION: LOOPS 71

In all previous examples, each piece of code was executed only once. Sometimes, how-ever, we want a line, or a block of code to execute repeatedly, for example to apply someoperation to each item in a list or dictionary. This is where loops come in.

There are two loop constructs: The for loop and the while loop. The for loop is wellsuited for iterating over a container. The while loop works differently – it is controlledby a condition that is evaluated before each run of the loop. If we do not want the loopto continue until doomsday, the code inside the loop must change a variable so that thecontrolling condition at some point becomes false.

The code that is supposed to be inside the loop, meaning that it should be executed intoto for each loop iteration, is indicated by indentation. Be sure to indent each line tothe exact same extent – by 4 empty spaces exactly.

7.6.1 Iterating over a dictionary

from gencode import genetic_codeprint genetic_code

for codon in genetic_code:amino_acid = genetic_code[codon]print codon, amino_acid

inverted = {}for codon, aa in genetic_code.items():

inverted[aa] = codon

print invertedprint len(genetic_code), len(inverted)

A new concept: Importing code from other files, or modules. In this case, we simplyimport a dictionary that was defined in the file gencode.py. Note that in the importstatement we omit the .py extension.

There are several ways to iterate over a dictionary; the second one shown here is par-ticularly useful. The len (for length) function counts the items in the two dictionaries.Can you figure out why the inverted dictionary is shorter than genetic_code?

7.6.2 Iterating over strings

from string import ascii_lowercaseprint ascii_lowercase # ’abcd...’

for character in ascii_lowercase:print character

length = len(ascii_lowercase)

Page 80: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

72 CHAPTER 7. PYTHON PROGRAMMING

for i in range(length):print i, ascii_lowercase[i]

for i, c in enumerate(ascii_lowercase):print i, c

Iterating over a string—that is, using a string as if it were a container inside a for loop—gives us one character in each run of the loop. The second example shows that we canalso index into a string by numbers. Finally, the third example shows how to obtain therunning number as well as the character without first defining a list of numbers: Theenumerate function does it for us. The enumerate function also works with lists ortuples.

7.6.3 More fun with strings

s = ’’’atgtatacta aaaattttag taattccagaatggaagtaa aaggtaataa cggctgttct’’’

fragments = s.split()print fragments

joined = "".join(fragments)print joined

print joined[0:3], joined[3:6], joined[6:9]

codons = []for i in range(0, len(codons), 3):

codon = joined[i:i+3]print i, codoncodons.append(codon)

print codons

Note the triple-quote syntax that allows us to define strings that span more than oneline. It is also useful if the string itself contains single quote characters. Instead oftriple single quotes, we can also use triple double quotes.

You should run this code and see what it does, and make sure you understand how itcomes about. Also try ’-’.join(fragments) to understand what the .join methoddoes.

7.6.4 Exercise: Translating DNA to protein

1. Use the genetic_code dictionary to build a program that translates the dna se-quence into the corresponding protein sequence.

Page 81: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

7.7. LIST COMPREHENSIONS 73

2. Build a program that reverse-translates the protein sequence back to a DNA se-quence encoding it.

We now have all the tools to translate a DNA sequence into protein, and back again. Canyou figure it out?

7.7 List comprehensions

. . . can be used to quickly create a list by iterating over another container. Example:Collect all keys from the genetic_code dictionary and convert them to lowercase.

from gencode import genetic_code

# use a looplower_keys = []for key in genetic_code:

keys.append(key.lower())

# now use a list comprehension andkeys2 = [x.lower() for x in genetic_code]

assert keys == keys2 # confirm that both are equal

List comprehensions offer a more concise syntax for simple loops. Where more complexoperations or conditions are involved, it is more readable to write the loops explicitly.

Note the somewhat backward looking syntax of the list comprehension. In this case,we use the expression x.lower() before we have even assigned a value to x. In eachother context, this would cause an error: NameError: name ’x’ is not defined.Here, this works because internally Python translates a list comprehension to a loopconstruct like the one above.

The assert statement ensures that some condition is fulfilled, which can be helpfulparticularly in development. Try: assert 1==2, ’numbers unequal’

7.7.1 Exercise: Use a list comprehension to translate a DNA sequence

from gencode import genetic_codefrom random import shuffle

DNA = genetic_code.keys()shuffle(DNA)print DNA

translated = []

Page 82: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

74 CHAPTER 7. PYTHON PROGRAMMING

for codon in DNA:translated.append(genetic_code[codon])

translated2 = [...] ?assert translated == translated2, ’you goofed’

Solution: translated = [genetic_code[codon] for codon in DNA]. Again, we usethe codon variable before assigning it a value.

7.8 Nested containers and loops

7.8.1 Nested containers

from atoms import atom_namesimport pprint # pretty printing of data structures

pprint.pprint(atom_names, width=100)

pprint.pprint(atom_names.items(), width=100)

The pprint module exists in the Python standard library. It is not part of the languageitself, so the code has to be imported. However, it is available on each Python installation.The Python standard library is very rich and provides solutions for many problems. Forserious use of Python, it is important to become acquainted with it. Documentation isavailable on-line at http://docs.python.org/library/.

The atom_names dictionary specifies the names of atoms that occur in the amino acidresidues in pdb files. In this dictionary, each value is a list. If we extract the items fromthis dictionary, the lists wind up inside the tuples, which in turn are the members of alist.

We can nest containers to arbitrary depth. By the way Python lists are one-dimensional;we can mimic multi-dimensional arrays using nested lists. For numerical heavy liftingthough, it is preferable to use the numpy package, which introduces proper array andmatrix data types. Numpy is not part of the standard library but is available throughyour package manager – as are many other useful and powerful libraries.

7.8.2 Nested loops

Nested containers often go with nested loops. Example: For each amino acid, extractthe element (the first letter) from each atom name for each amino acid, and construct anew dictionary.

from atoms import atom_names

Page 83: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

7.8. NESTED CONTAINERS AND LOOPS 75

from pprint import pprint

elements = {}

for aa, names in atom_names.items():elm = []for name in names:

elm.append(name[0])elements[aa] = elm

pprint(elements)

Note the nested indentation that defines what lines are repeated in each outer loop orinner loop, respectively.

7.8.3 Rewrite this using list comprehensions?

elements2 = {}for aa, names in atom_names.items():

elements2[aa] = [name[0] for name in names]

elements3 = dict([(aa,[name[0] for name in names]) \for aa, names in atom_names.items()])

assert elements == elements2 == elements3

While the elements3 version is the shortest, the elements2 version is simpler andclearer. Terse one-liners can be nice as brain teasers but not really recommended forreal programs – readability beats terseness.

Page 84: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

76 CHAPTER 7. PYTHON PROGRAMMING

7.9 Conditional execution

firstname = "Joe"

if firstname in ("Joe","James","Jim"):greeting = "Sir:"

elif firstname in ("Joan", "Jane", "Jill"):greeting = "Madam:"

else:greeting = "Sir/Madam:"

print "Dear", greeting

Conditional execution is an important building block of almost any program. Theif clause takes another statement that it subjects to boolean evaluation, that is itdetermines whether this statement is true or false. If it is true, the code controlled by ifclause executes, otherwise it doesn’t. This conditional code is indicated by indentation;in the above example, it is just one line (greeting = "Sir:".

The elif clause takes effect only if the preceding if clause did not apply. Again, itrequires a statement or expression that is tested for truth or falsehood.

Finally, the else clause will only take effect if all preceding clauses failed. Unlike the ifand the elif clauses, the else clause does not take any test statement and will executein any case.

Only the if clause is required. There can be any number of elif statements, which willbe evaluated in order. The else statement is optional, too; it can occur only once.

7.9.1 Conditional execution inside a loop

for i in range(1,11):if i % 3 == 0: # % - remainder of division

print i, ’divisible by 3’elif i % 2 == 0:

print i, ’divisible by 2’else:

print i, ’divisible by neither 2 nor 3’

Conditional execution very often occurs inside loops, such that the if clause is con-trolled by one or more variables that change in each loop iteration.

Page 85: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

7.10. BOOLEAN EVALUATION OF EXPRESSIONS 77

7.10 Boolean evaluation of expressions

for thing in [0, 1, -1, 0.0001, [], [1,2], [0], \{}, {0:0}, ’joe’, ’’, None, False, True]:

print str(thing).ljust(10), bool(thing)

print False, True, False and True # the and operatorprint False, True, False or True # the or operator

print True, not True # the not operatorprint False, not False

print (False and False) or True # ’and’ or ’or’ - whichprint False and (False or True) # one has precedence?print False and False or True

Containers or strings that are not empty, and numbers that are not zero evaluate to Truein a boolean context. Boolean evaluation is implicitly performed by if...elif...elsestatements and in while loops. We can also explicitly invoke it with the built-in boolfunction; for example, bool(-3) returns True.

The False and True values used in the and...or examples are dummies. In real life,and and or would be used like so:

The False and True values used in the and...or examples are dummies. Inreal life, and and or would be used like so:

if (firstname == ’Joe’ or firstname == ’Jim’) and age > 15: greeting ="Sir:"

The two == statements and the > are then evaluated to True or False, respectively,and the results of that evaluation are combined with and and or.

7.10.1 Alternative formulation of conditionals

for i in range(1,11):if not i % 3:

print i, ’divisible by 3’elif not i % 2:

print i, ’divisible by 2’else:

print i, ’divisible by neither 2 nor 3’

7.10.2 Exercise: What about 6?

One problem with the loop example in section 7.9.1 is that the number 6 gets evaluatedfor divisibility by 3 only, but not for 2, since the Can we rewrite the previous exampleso that each number gets evaluated for divisibility by both 2 and 3?

Page 86: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

78 CHAPTER 7. PYTHON PROGRAMMING

7.11 Functions

7.11.1 Defining functions

def divisible_by(dividend, divisors):’’’Tests a dividend for divisibilityby a list of divisors. Returnsa list with all valid divisors.’’’divs = []for d in divisors:

if not dividend % d:divs.append(d)

return divs

for i in range(1,11):result = divisible_by(i, [2,3])print i, ’divisible by ’, result

We have already used a couple of built-in functions, for example range or dir. Here wesee an example how to define functions ourselves. The function is declared with thedef statement, which determines the name of the function, as well as the argumentsthat the function expects. Arguments in this context are data that are handed over tothe function when the function is called. In this example, the function calls occur insidethe loop, successively on each of the numbers from 1 to 10.

The body of the function, that is the code contained within, is again defined by indenta-tion. In this example, the first part of the function body is a string – a short text thatexplains the purpose of the function. This is not necessary but is good practice. Thisso-called doc-string will be displayed by python if we type (in an interactive session)help(divisible_by).

The last statement in the function is return divs. This means that the divs list thatwas computed inside the function should be handed back to the piece of code thatcalled the function. In our example, the divs list computed inside the function will beassigned to the variable result that was used in the for loop, outside the function.

Note that the return statement is not mandatory. Some functions don’t produce anydata and accordingly don’t return any data either. An example is the pprint functionthat we used above (section 7.8.1). If we say, for example, b = pprint(a), then b willbe equal to None.

Another point to note is that variables that were declared inside the function will not bevisible from the outside, and will cease to exist once the function has finished executingand returned control to the code outside. In the example, the variable divs is declaredinside the function and exists only within it.

Page 87: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

7.12. IMPORTING CODE 79

7.11.2 Functions with default arguments

def greet(name, greeting = ’Hello’):print greeting, name

greet(’Joe’, ’Good morning’) # prints ’Good morning Joe’greet(’Jane’) # prints ’Hello Jane’

def greet(name=’Joe’, greeting=’Hello’):print greeting, name

greet() # prints ’Hello Joe’greet(’Hi’) # print ’Hello Hi’greet(greeting=’Hi’) # print ’Hi Joe’

This example shows how to define functions with default arguments. Note in particularthe last example: By default, any arguments that we supply are used in order of decla-ration, meaning in this case that a single argument is used as a value for name, not forgreeting. If we want to supply a value for greeting but not for name, we can declarethis explicitly.

7.11.3 Exercise: Generating random passwords

Write the appropriate code for the following function:

def pwd(length=8):’’’produce a random password consisting ofa random sequence of any number of lowercaseletters, but with the length defaulting to 8.’’’

When done, try to extend the function so as to also use uppercase letters and digits.We skipped this exercise in class – but it might be fun for you to try on your own. Allthe necessary tools have already been provided.

7.12 Importing code

import sys # import the sys moduleprint sys.path # access a name defined in sys

from re import compile # import a name from a module

Page 88: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

80 CHAPTER 7. PYTHON PROGRAMMING

r = compile(’^joe$’) # use the imported name directly

from scipy.stats import poisson # scipy is a package,# containing stats as one# of its modules

# we can also rename modules or other objects# during import, for example to use shorter namesimport some_module_with_very_long_name as smfrom some_module import very_long_name as vln

This example illustrates

7.12.1 Importing self-written code

from divisible import divisible_byhelp(divisible) # press ’q’ to quit the help view

for i in range(1,11):result = divisible_by(i, [2,3])print i, ’divisible by ’, result

When we have written a function, we often want to make it available and reusable inother code files. In this example, we save the code from section 7.11.1 into the filedivisible.py. Then, we can import it into another code file, or an interactive sessionlike shown here.

Where does Python look for code files (modules)?

import sysprint sys.path# on my machine, shows:# [’’, ’/home/mpalmer/’,# ’/data/python_mine’,# ’/data/python_foreign’, ...

Without further action, this will work only as long as the two code files are in the samedirectory. To make our own Python code files (modules) available across directoryboundaries, we save them in a dedicated directory that we then add to sys.path, alist that contains the names of all directories that Python will search for modules to beimported.

The recommended way of adding directories to sys.path is to set the PYTHONPATHenvironmental variable. For example, in the .profile file in my home directory, I havethe following line:

export PYTHONPATH=:/data/python_mine:/data/python_foreign

Page 89: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

7.13. EXCEPTIONS 81

On start-up, Python reads PYTHONPATH and adds the names listed there to sys.path.In my case, I store my own code in /data/python_mine.

If, on Linux, you install Python libraries, such as scipy or matplotlib, through the packagemanager, then sys.path will usually be updated as required. On the other hand, if youdownload Python code from somewhere on the web, you have to take care of thisyourself. I deal with this by saving such code in my folder /data/python_foreign,which I have included in my PYTHONPATH.

7.13 Exceptions

from divisible import divisible_by

r = range(5) # [0, 1, 2, 3, 4]print rprint divisible_by(10, r)

r = list(’joe’) # [’j’, ’o’, ’e’]print rprint divisible_by(10, r)

Exceptions happen if Python cannot perform an instruction. This can be due to faultysyntax, in which case we get a SyntaxError, or because of faulty or missing data. In thefirst example here, we get a ZeroDivisionError, since we pass a 0 as the first divisorto the divisible_by function. In the second example, we get a TypeError, because weattempt a numeric operation on letters.

If an exception occurs, execution of the program stops, unless we provide code to catchthe exception, that is deal with it and arrange for the program to recover and continue.

7.13.1 Catching exceptions

from divisible import divisible_by_protected

for i in [4, ’joe’, 6]:result = divisible_by_protected(i, [0,2,3,’jim’])print i, ’divisible by’, result

The try-except construct allows our code to recover from bad input. This is particularlyimportant in those parts of a program that receive data directly from the outside world.For example, a program that expects certain options and arguments as its input on thecommand line should have a way to fail gracefully and print a helpful message in plainEnglish, instead of just confronting the user with a cryptic Python error traceback, thatis a printout of the code lines that were executing before the error occurred. In othercases, we might just replace the missing or faulty input with some default values, and

Page 90: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

82 CHAPTER 7. PYTHON PROGRAMMING

7.14 Reading and writing files

Here is how you read the content of a file:

f = open(’1X9P.pdb’, ’r’) # open file in read modeblurb = f.read() # read the entire file in one

# go into the variable blurbprint len(blurb) # number of bytes in the file

lines = blurb.splitlines() # break it up into linesfor line in lines[:10]: # print the first ten lines

print line

f.seek(0) # ’rewind’ for reading againlines = f.readlines() # read lines directlyfor line in lines[:10]:

print line

The open function takes the file name and the opening mode, which can be either ’r’for reading or ’w’ for writing. If not given, the mode is ’r’, so open(’myfile’) isequivalent to open(’myfile’), ’r’.

Note that the open function does not search the sys.path list that is used by import.Files have to be in the current directory, or else the path to the file has to be explicitlyspecified. For example, a file in a sub-directory can be opened with open(’subdir/myfile’),and one in the upper directory can be specfied like open(’../myfile’)

The difference between the .splitlines and the .readlines method is that in thefirst case the linebreaks will be stripped from the end of each line, whereas they areretained with .readlines.

Since print adds a newline to each line it prints, we get an extra empty line betweeneach two lines of text in the second example.

7.14.1 Writing files

infile = open(’1X9P.pdb’) # read modeoutfile = open(’atoms.pdb’,’w’) # write mode

for line in infile.readlines():if line.startswith(’ATOM’):

outfile.write(line)

infile.close()outfile.close()

Page 91: Chem 731 – Computer methods for studying protein structure ...science.uwaterloo.ca/~mpalmer/stuff/ccnotes.pdf · teers and companies such as IBM and Novell, so no longer a toy 2.Used

7.14. READING AND WRITING FILES 83

This example shows a shortcut for iterating over all the lines in a file: The for line ininfile statement is equivalent to for line in infile.readlines(), except thatthe whole file is not read in one go, but each line is fetched separately. On today’scomputers with abundant memory (my first computer had a whopping 1 MB memory!)and fast hard drives, this difference rarely matters.

When we are done with files, we should close them. If we don’t, Python will do it for useither on program exit or when the last variable that points to the open file is destroyed(for example because the function in which it was declared has returned).

7.14.2 Files and functions: Exercise

Write the appropriate code for this function:

def filter_pdb(infilename, outfilename, record=’ATOM’):’’’Reads PDB file <infilename> and write each linethat starts with <record> into <outfilename>.’’’

1. Modify the code from the last example so that it takes an optional outfilenameargument. If given, the function should write the filtered lines to a file of thatname. In any case, the filtered lines should also be returned in a list.

2. Modify the function so that it takes one or more record arguments; lines shouldbe retained if they match any of those.

Again, we skipped this is class – trying it for yourself can’t hurt.