Introduction to python for bioinformatics
-
Upload
giovanni-dallolio -
Category
Technology
-
view
21.466 -
download
3
description
Transcript of Introduction to python for bioinformatics
prbb technical seminars
Introduction to Pythonfor bioinformatics
Giovanni MarcoDall'OlioUnidad de Biologia Evolutiva – CEXS
Barcelona (Spain)
Python
A programming language released in 1991 by Guido Van Rossum
Used for a variety of applications, from scripting to web programming
Adopted by google, yahoo, youtube, CERN, Nasa, Red Hat....
Lots of jokes in the documentation (it is named after the Monty Pythons)
Python and bioinformatics
Python is widely used in bioinformatics
August 2007 survey - www.bioinformaticszen.com www.bioinformatics.org survey
Python – overall view
Learning curve ☺☺☺☺☺ Easy to learn, yet powerful
Readibility of a python program
☺☺☺☺☺
Community, availability of open source modules
☺☺☺☺ (for bioinformatics, CPAN is sligthly bigger)
Programming paradigms
☺☺☺☺☺ Multi paradigm (Object Oriented, structured, functional, etc..)
Execution speed Interpreted language;importance of programmer effort over computer effort
Notes:
This talk is full of tables like this
They only reflect my opinion (biologist with 3-4 years experience)
Python – Cons
State of open source libraries for bioinformatics
There is good support, but less compared to perl and R
Execution speed Comparable to perl, java, ruby, ..
SOAP libraries SOAPpy is very old, suds is the best one
Population Genetics modules
As many other specific modules, perl and R are better supported
Lack of true multithreading support
A structural limit make it impossible to have real multithreading in python (various solutions..)
= very sad = fine
Python – what makes me happy
General syntax ☺☺☺☺☺People are forced to write program similar to yours
☺☺☺☺☺
Quicker to write programs
☺☺☺☺☺
Object Oriented, multi-paradigm
☺☺☺☺☺ (will be explained later)
Testing support ☺☺☺☺ ''
Python – learning curve
Python's syntax is easy You can concentrate on algorithms and problems
instead of the programming language
Python – learning curve
Python's syntax is easy So you can concentrate on algorithms and problems
instead of the programming language With python you don't have to worry of:
Learning strange symbols (~=, <>, eq, '\n', {}...) Alternative syntaxes to do the same task Declaring variables Inner structure of strings/arrays Low level IO, passing variables per reference/value, etc..
Example of python code
#!/usr/bin/env python'''Some python examples'''
# example 1: a 'for' loopfor name in ('Albert', 'Aristoteles', 'Archimedes'):
print 'hello, ', name
# example 2: Opening a file and parsing itfilehandler = open('samplefile.txt', 'r')for line in filehandler.readlines():
if line.startswith('>'):print line
else:pass
Python syntax I - indentation
In python, the indentation ( = spaces at the beginning of the line) is part of the syntax.
It is used to delimit loops and conditions, instead of graph parenthesis ({})
Example:
The first 'print' is inside the cycle, while the second is outside
for name in ('Albert', 'Aristoteles', 'Dayhoff'):print 'hello, ', name
print 'and hello to you, too'
A quick perl/python comparison
#!/usr/bin/env python
a = 3
if a == 3:print 'a is eq to 3'
#!/usr/bin/env perl
my $a = 3;
if ($a == 3){ print "a is eq to 3\n";}
(Python) (Perl)
Python code is usually easier to read and contains less symbols (like {})
Python syntax II - simplicity
Python has the minimal number of syntax keywords.
There is: only one way to open files (no 'fopen', 'openf', etc..)
only one to print (no printn, printf, sprintf, sprint, etc..)
only two ways to define loops ('for' and 'while').
Python's phylosophy is about simplicity.
Your colleagues are forced to write their programs in the same way as you.
Python syntax III – declaring var
You don't need to declare variables The type of a variable is defined the first time
you assign a value to it
a = 'cacagtcaga' → a is a string
b = 133 → b is an integer
c = True → c is a boolean
Notes on Python's speed
Python is an interpreted language its speed is at the level of perl, java, etc. programs are slower than C, but it's faster to write
them importance of programmer effort over computer
effort
Many ways to speed up python modules can also be written in C some compilers exist (PyPy) Google is working on an enhanced version of
python (news of March 2009).
Python – programming goodies
Installation and portability
☺☺☺☺☺ Installed by default in most linux distribution, interpreted
IDLE / text editors
☺☺☺☺☺ Interactive shell, ipython, many editors
Install and search new modules
☺☺☺ easy_install, PyPI
Testing support ☺☺☺☺☺ doctest, unittest, nose
Writing documentation
☺☺☺☺☺
Debugging Logging, pdb
Python – installation and portability
Python comes installed by default in most of the GNU/Linux distributions Mac users have an old version (2.5), but can
upgrade it On windows, you need to dowload an installer from
www.python.org first
Being an interpreted language, python programs are easy to port in other platforms
PyPI (Python Package Index)
PyPI (repository of public python modules) pypi.python.org
PyPI is a repository of open source modules for python
For bioinformatics, it is smaller than to CPAN, CRAN/bioconductor, etc..
Python – installing new modules
Modules can be automatically downloaded and installed using a tool called 'easy_install'
Examples: easy_install -U biopython # install or update
biopython from PyPI easy_install --prefix ~/usr biopython # install biopython
without requiring admin privilegies easy_install biopython.tar.gz # install biopython from a
previously downloaded tar ball easy_install http://www.biopython.org/install # install
biopython from its web site
Using python
Python can be used as an interactive shell (like R, octave, matlab, etc..) or by writing programs
gioby@dayhoff:~$ python
>>>>>> print 'hola''hola'
>>> range(5)[0, 1, 2, 3, 4]
gioby@dayhoff:~$ cat > prog.py
print 'hola'range(5)[0, 1, 2, 3, 4]
gioby@dayhoff:~$ python prog.py
(python interactive shell) (a python program)
Python interactive shell
You can use it to run programs without having to save them to a script.
It has not a 'session' equivalent like in R
Many programmers prefer to use 'ipython', an enhanced version of this shell
gioby@dayhoff:~$ pythonPython 2.5.2Type "help", "copyright", "credits" or "license" for more information.>>>>>> print 'hola''hola'
>>> range(5)[0, 1, 2, 3, 4]
IPython sessiongioby@dayhoff:~$ ipythonType "copyright", "credits" or "license" for more information.
In [1]: import random
In [2]: random.choice(['ciao', 'hola', 'hello'])Out[2]: 'hello'
In [3]: 1200 / 2Out[3]: 600
In [4]: random?(shows documentation on the random module)
In [5]: random.<TAB>(shows auto-completition)
In [6]: !ls(executes a bash command)
Programming paradigms and testing
Programming paradigms
☺☺☺☺☺ Multi paradigm (Object Oriented, Structured, Functional, etc..)
Testing support ☺☺☺☺☺ doctest, unittest, nose
Python is a multiparadigm language
Your python programs can be a simple list of instructions (imperative approach),
or you can write functions (functional) or you can use objects (object oriented) It's a multi-paradigm language
Python as a imperative language
print 'Hi, I am the psychotherapist'print 'How do you do? What brings you here?'
response = raw_input()print 'can you elaborate on that?'
response = raw_input()print 'Why do you say it is ', response, '?'
....
Python as a functional language
def get_sequence(fastafilehandler):'''extracts the sequence from a fasta file'''sequence = ''for line in filehandler.readlines():
if line.startswith('>'):sequence += line
else:pass
def main():'''execute the main functions'''filepath = 'samplefile.txt' filehandler = open(, 'r')get_sequence(filehandler)
.....
Object Oriented Programming explained in two sentences
When you start having complicated nested variables (like arrays of hashes of arrays of
lists of .....)→ Object Oriented programming is something you should look at
Object Oriented Programming example
genes = {'gene1': { 'position': 10000, 'chromosome': 11, 'sequence': 'GTAGCCTGATGAACGGGCTAGCATGC....', 'transcripts':
{'transcript1': [......],'transcript2': [......],
},},
'gene1':{ 'position': ...........},
.....}
def get_subseq(genes, geneid, start, end):''' get a subsequence of a gene, given a dictionary of gene
annotations, a gene id, and start/end position '''pass
Object Oriented Programming example
genes = {'gene1': { 'position': 10000, 'chromosome': 11, 'sequence': 'GTAGCCTGATGAACGGGCTAGCATGC....', 'transcripts':
{'transcript1': [......],'transcript2': [......],
},},
'gene1':{ 'position': ...........},
.....}
def get_subseq(genes, geneid, start, end):''' get a subsequence of a gene, given a dictionary of gene
annotations, a gene id, and start/end position '''pass
A python class
class gene:
def __init__(self):position = None sequence = ''transcripts = []
def get_subseq(self, start, end):pass
Python's syntax for classes is easy
More concise than Java, and not mandatory to use classes
OO is very complicated in Perl
Python and Java classes
public class Gene {
public int position;public str chromosome;public str transcripts[];
public Gene(int pos){position = pos
}
public void getSubseq(start, end) {
pass}
class gene:
def __init__(self,pos):self.position = pos self.sequence = ''self.transcripts = []
def get_subseq(self, start, end):
pass
(A Python Class)
(A Java Class)
Three ways to test a python program
When you write a program or a script and want to publish its results, you also need a way to prove that it works correctly
Python has good instruments for testing: Doctest Unittest Nosetest
doctest
With doctest, you put examples of the usage of a function in its documentation
>>> help(say_hello)
Help on function say_hello in module __main__:
say_hello(name) print hello <name> to the screen example: >>> say_hello('Albert Einstein') hello Albert Einstein!!!
Doctests tries to re-execute these examples, and if they don't return the expected values, an error is raised
Doctest example 2
Doctest example 3
Doctest are useful when you collaborate with non programmers
unittest
From unittest import *
class SimpleFastaSeqCase(unittest.TestCase):
@classmethoddef setUpClass(cls):
.....@classmethoddef tearDownClass(cls):
.....def setUp(self):
.....def tearDown(self):
.....def testCondition1(self):
.....def testCondition2(self):
.....
Instructions to be executed before/after all the tests
Instructions to be executed before/after each one of the tests
Tests
nosetest
Nosetest - it scans your code and looks for all the functions with the word 'test_' in their names
def getfasta(filename):pass
def count_numbers(numbers, limit):pass
def you_like_this_talk(subliminal = True)pass
def test_everything_ok():pass This is a test
Message
Python is easy to learn and write It has good tools to test and demonstrate that
your programs work correctly
Python – some bioinfo use cases
Regular expressions, motif search
re, TAMO, biopython ☺☺☺☺ To use regular expressions, it is necessary to import a module. Getting help is easier
Convert a sequence file to another format
biopython ☺☺☺☺ Biopython is growing its support for bioinformatics formats
Working with genomic data
pygr ☺☺☺☺ Pygr is a great environment to work with genomic data
Query Genbank Biopython, pygr ☺☺☺☺Structural Bioinformatics
I don't know
Regular Expressions in Python
Using Regular Expressions in Python requires an additional step than Perl
You have to import a module called re first Regular expressions are also less 'central' to
the developers of the language
Example – Regular Expressions in python
>>> import re
>>> sequence = 'ACGGCTAGGTCGATGCGATCG'
>>> re.findall('A.G', sequence)['ACG', 'AGG', 'ATG']
>>> help(re)<get help on regular expressions>
The only advantage of python over perl for regular expression is that it is easier to get help
Biopython
A collection of free modules for bioinformatics number of functionalities implemented:
bioconductor > bioperl > biopython > all others
Strong points: File format support NCBI – entrez APIs Pdb / structures
Biopython Examples
# Parse a Fasta File and convert it to Genbankfrom Bio.SeqIO import SeqIOseqfile = open('fastafile.fa', 'r')
sequences = SeqIO.to_dict(SeqIO.parse(seqfile))
# Query NCBIresults = Entrez.esearch(db='nucleotide', term='cox2')Entrez.read(results)
Pygr
Great for genome-wide analysis Makes it automatic to
Store/retrieve data in databases or pickles Use and configure local blast databases Creating annotations and storing them Interface with ncbi, ensembl (eq. to ensembl perl
APIs), ucsc
Pygr examples
# Ensembl APIsserverRegistry = get_registry(
host= 'ensembldb.ensembl.org',user='anonymous')
coreDBAdaptor = serverRegistry.get_DBAdaptor( 'homo_sapiens', 'core',
'47_36i')sequence = coreDBAdaptor.fetch_slice_by_seqregion(
coordSystemName, seqregionName)
# Download the sequence of the Human Genome (18)import pygr.Datahg18 = pygr.Data.Bio.Seq.Genome.HUMAN.hg18(
download=True)
TAMO and pyHMM
>>> from TAMO import MotifTools>>> msa = ['TGACTCA',... 'TGACTCA',... 'TGAGTCA',... 'TGAGTCA']
>>> m_msa = MotifTools.Motif(msa)>>> print m_msaTGAsTCA(4)
>>> m_msa._print_counts()# 0 1 2 3 4 5 6 #A 0.000 0.000 4.000 0.000 0.000 0.000 4.000 #C 0.000 0.000 0.000 2.000 0.000 4.000 0.000 #T 4.000 0.000 0.000 0.000 4.000 0.000 0.000 #G 0.000 4.000 0.000 2.000 0.000 0.000 0.000
Module to work with motifs
Python – bioinformatics utilities
Scientific and statistics
scipy + numpy ☺☺☺☺☺
Plotting graphs Matplotlib (pylab) ☺☺☺☺☺SOAP / web scraping utilities
suds ☺☺☺
ORM modules, database handling, HDF5
Sqlalchemy + elixir, sqlobject, pytables ☺☺☺☺☺
Persistent data cPickle, shelf, ZODB ☺☺☺☺ No R-like sessions
Python and Databases
There are some good libraries to Object Relational Mapping (ORM)
ZODB: Object Oriented Database PyTables: hierarchical database (supports
HDF5, a binary format used in astronomy/physics to store big data)
sqlalchemy example
Scientific Python
Numpy: python module to work with arrays and matrixes
Scipy: module to do advanced math, statistics, and more
Matplotlib: module to plot graphics To get started with python and plotting graphs:
$: easy_install numpy scipy matplotlib ipython$: ipython -pylab
Numpy/Scipy example
Hint: use ipython -pylab to have an R-like environment
Is there anything I forgot?
?????
?????
?????
Thank you for the attention!
PRBB technical seminars: http://bg.imim.es/technical-seminars/
These slides will be uploaded on http://www.slideshare.net
Discarded slides
Hint: use ipython -pylab
The best way to work with python and plotting graphs is with ipython -pylab
It will give you a shell similar to matlab/octave/R/etc..
Regular expressions
To use regular expressions in python, you need to import the 're' module first
It's not so immediate as with perl, where you can use regular expressions without importing anything
However, it is easier to get the documentation
Main python modules for bioinformatics
Biopython Pygr
Python – storing/accessing data
Reading/Writing files ☺☺☺☺☺Persistent data ☺☺☺ cPickle, shelf, ZoDB
Database – Object Relational Mapping libraries
☺☺☺☺☺ sqlalchemy, elixir
Binary formats (HDF5)
☺☺☺☺ pytables
R-like sessions Nothing of my knowledge :(