Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ......
Transcript of Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ......
1
Programming in Python
Michael Schroeder Sven [email protected]
Updates by Andreas Henschel
Lecture 2: Sequences
Slides derived from Ian Holmes, Department of Statistics, University of Oxford
2
Overview
• Types of sequences and their properties– Lists, Tuples, Strings, Range
• Building, accessing and modifying sequences• List comprehensions• File operations
3
Types and Properties of Sequences
4
Lists vs tuples• Both are sequences (used to store collections of objects)• Tuples are immutable, Lists mutable• List are more flexible• Tuples provide better performance• Rule of thumb: Lists for similar kind of objects, tuples for different
l = [1,2,3,4]l2 = [‘Apple’, ‘Banana’, ‘Orange’]
t = (‘sebastian’, ‘m’, 28)t2 = (‘motif’, ‘ATTCG’, ‘E44’)
Construction (Syntax)
Accessing Elementsl[0] t[0]1 sebastian
l.append(3)l[1] = 5
t.append(3)t[1] = 5
l3 = l+[3,2] t3 = t + (‘phd’,’biotec’)
Adding/modifying Elements
Concatenating
immutable !
5
Range
• Used to provide collections of sequent integer numbers• Allow iteration with loops
• Numbers are not stored in memory, but just generated when needed (while looping)
• Saves time and memory with larger number sets
for x in range(10000):print(x)
0123......99989999
Excluding last number!
6
Working with Lists
7
Lists
A list is a collection of values/objects
We can think of the above as a container with 4 entries
nucleotides = ['a', 'c', 'g', 't']print("Nucleotides: ", nucleotides)
Nucleotides: ['a', 'c', 'g', 't']
a c g telement 0
element 1 element 2element 3
the list is the collection of all four elements
Note that the elementindices start at zero!
8
List literals
• There are several ways to create or obtain lists.
a = [1,2,3,4,5]print("a = ",a)b = ['a','c','g','t']print("b = ",b)
c = list(range(1,6))print("c = ",c)d = "a c g t".split()print("d = ", d)
a = [1,2,3,4,5] b = ['a','c','g','t']c = [1,2,3,4,5] d = ['a','c','g','t']
This is the most common: a comma-separated list, delimited by squared brackets
9
Accessing lists
To access list elements, use square brackets e.g. x[0]means "element zero of list x"
• Remember, element indices start at zero!• Negative indices refer to elements counting from the
end e.g. x[-1] means "last element of list x"
x = ['a', 'c', 'g', 't']i= 2print(x[0], x[i], x[-1]) a g t
10
List operations
• You can sort and reverse lists...
• You can add, delete and count elements
x = ['a', 't', 'g', 'c']print("x =",x)x.sort()print("x =",x)x.reverse()print("x =",x)
x = ['a', 't', 'g', 'c']x = ['a', 'c', 'g', 't']x = ['t', 'g', 'c', 'a']
nums = [2,2,5,2,6]nums.append(8)print(nums)print(nums.count(2))nums.remove(5)print(nums)
[2,2,5,2,6,8]3[2,2,2,6,8]
11
More list operations
>>> x=[1,0]*2>>> x[1, 0, 1, 0]
>>> x.pop()0>>> x[1, 0, 1]
>>> x+=x>>> x[1, 0, 1, 1, 0, 1]
>>> x.index(0)1
pop() obtains and
removes the lastelement of a list
multiplying lists
concatenating lists with +or +=
index(..) searches for thefirst occurrence of an element
12
Example: Reverse complementing DNA
dna = "accACgttAGgtct".lower()
replaced = dna.replace("a",“_a") \.replace("t","a").replace(“_a","t") \.replace("g",“_g").replace("c","g") \.replace(“_g", "c")
replacedList = list(replaced)replacedList.reverse()
print("".join(replacedList))
agacctaacgtggt
Start by making string lower caseagain. This is generally good practice
Convert back to string using join
Replace 'a' with 't', 'c' with 'g','g' with 'c' and 't' with 'a'
A common operation due to double-helix symmetry of DNA
Convert to list and reverse
13
Taking a slice of a list
• The syntax x[i:j] returns a list containing elements i,i+1,…,j-1 of list x
nucleotides = ['a', ’g’, 'c', 't']print(nucleotides)print(nucleotides[0:2]) # nucleotides[:2] also worksprint(nucleotides[2:4]) # nucleotides[2:] also worksprint(nucleotides[-2:]) # takes last two elementsprint(nucleotides[::2]) # takes every secondprint(nucleotides[::-1]) # obtains reversed list
['a', 'g', 'c', 't']['a', 'g']['c', 't']['c', 't'][‘a', ‘c'][‘t', ‘c', ‘g', ‘a']
14
Lists and Strings
• A string can be translated into a list of strings and– Using the split method: string.split(separator)
• A list of strings can be translated into one string– Using the join method: separator.join(list)
sentence = ‘This is a complete sentence.’print(sentence.split())
[‘This’, ‘is’, ‘a’, ‘complete’, ‘sentence’]
datarow = ‘Apples,Bananas,Oranges’print(datarow.split(‘,’))
[‘Apples’,’Bananas’,’Oranges’]
cities = [‘Dresden’, ‘Munich’, ‘Hamburg’, ‘Cologne’]print(‘ -> ’.join(cities))
‘Dresden -> Munich -> Hamburg -> Cologne’
15
List Comprehensions
16
What are list comprehensions?
• Very concise way to build and transform lists• Typically replaces a for loop and an if-construction• Used very often in Python• Syntax: [expr(var) for var in sequence if condition]
newlist = []for x in range(1,11):
if x % 2: newlist.append(x**2)
Verbose construction of list
[1,9,25,49,81]
newlist = [x**2 for x in range(1,11) if x % 2]
Construction with list comprehension
Squares of all odd numbers between 1 and 10
17
Examples: List comprehensions
sentence = ‘I like MySQL but not Python’print([(w.lower(), len(w)) for w in sentence.split()])
[(i, 1), (like, 4), (mysql, 5), (but, 3), (not, 3), (python, 6)]
numbers = (1,0,-1,6,3,-2,3,4)sum = sum([x for x in numbers if x >0])print(sum)
17Sum up all positive integers in a tuple
18
File IO
Opening and reading a file
f = open(‘myfile.txt’, ‘r’)for line in f:
if not line.startswith(‘#’):print(line)
f.close()
#Old number1234# New number5555# Test1
123455551
Returns file handler
Loop variable Linewise iteration over file!
File mode (r, w, a, ...)
with open(‘myfile.txt’, ‘r’) as f:for line in f:
if not line.startswith(‘#’):print(line)
Shorter and better formFile is closed after block!
20
Example: FASTA format
• A format for storing multiple named sequences
• This file contains 3' UTRsfor Drosophila genes
CG11604CG11455CG11488
>CG11604TAGTTATAGCGTGAGTTAGTTGTAAAGGAACGTGAAAGATAAATACATTTTCAATACC>CG11455TAGACGGAGACCCGTTTTTCTTGGTTAGTTTCACATTGTAAAACTGCAAATTGTGTAAAAATAAAATGAGAAACAATTCTGGT>CG11488TAGAAGTCAAAAAAGTCAAGTTTGTTATATAACAAGAAATCAAAAATTATATAATTGTTTTTCACTCT
Name of sequence ispreceded by > symbol
NB sequences canspan multiple lines
fly3utr.txt
21
Example: FASTA format
with open(‘fly3utr.txt’, ‘r’) as f:for line in f:
if line.startswith(‘>’): print(line[1:])
CG11604CG11455CG11488
What if we want to show the length of
each sequence record?
>CG11604TAGTTATAGCGTGAGTTAGTTGTAAAGGAACGTGAAAGATAAATACATTTTCAATACC>CG11455TAGACGGAGACCCGTTTTTCTTGGTTAGTTTCACATTGTAAAACTGCAAATTGTGTAAAAATAAAATGAGAAACAATTCTGGT>CG11488TAGAAGTCAAAAAAGTCAAGTTTGTTATATAACAAGAAATCAAAAATTATATAATTGTTTTTCACTCT
22
Example: FASTA format
name = Nonelength = Nonewith open('fly3utr.txt', 'r') as f:
for line in f:line = line.rstrip()if line.startswith('>'):
# None -> Falseif name:
print(name, length)name = line[1:]length = 0
else:length += len(line)
print(name, length)
CG11604 58CG11455 83CG11488 69
>CG11604TAGTTATAGCGTGAGTTAGTTGTAAAGGAACGTGAAAGATAAATACATTTTCAATACC>CG11455TAGACGGAGACCCGTTTTTCTTGGTTAGTTTCACATTGTAAAACTGCAAATTGTGTAAAAATAAAATGAGAAACAATTCTGGT>CG11488TAGAAGTCAAAAAAGTCAAGTTTGTTATATAACAAGAAATCAAAAATTATATAATTGTTTTTCACTCT
23
Summary
• Strings, lists, tuples and ranges are all sequences• Lists (usually for elements of same type)
– More flexible, more memory consumption
• Tuples (usually store elements of different types)– Immutable, less memory consumption
• Ranges for fast numeric iteration– Least memory consumption
• List comprehension as concise way to transform sequences• Convert strings into lists and vice versa with join and split• File handlers provides line-wise iteration