Parsing data records
description
Transcript of Parsing data records
![Page 1: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/1.jpg)
Parsing data records
![Page 2: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/2.jpg)
>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN
![Page 3: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/3.jpg)
>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN
A sequence record in FASTA format
![Page 4: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/4.jpg)
seq = ">sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens \MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS\WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY\LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY\YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD\AGEGEN"
for i in seq: print i
![Page 5: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/5.jpg)
seq = open("SingleSeq.fasta")
for line in seq: print line
![Page 6: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/6.jpg)
seq = open("SingleSeq.fasta")seq_2 = open("SingleSeq-2.fasta")
for line in seq: seq_2.write(line)
seq_2.close()
![Page 7: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/7.jpg)
Writing different things depending on a condition
Read a sequence in FASTA format and print only the header of the sequence
>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN
![Page 8: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/8.jpg)
seq = open("SingleSeq.fasta")
for line in seq: if line[0] == '>': print line
>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN
![Page 9: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/9.jpg)
Making choices: The if/elif/else statements
if <condition 1>: if expression in <condition1> is TRUE<statements 1> execute statements 1
[elif <condition 2>]: else if exp in <condition2> is TRUE<statements 2>] execute statements 2....
[elif <condition 3>]: etc...pass]
…[else:
<statements N>]
![Page 10: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/10.jpg)
>>> s="MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAE">>> s_len = float(len(s))>>> G_num = s.count('G’) >>> A_num = s.count('A’)>>> freq_G = G_num/s_len>>> freq_A = A_num/s_len>>> print freq_G0.116666666667>>> print freq_A0.216666666667
Write different things depending on a condition
![Page 11: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/11.jpg)
>>> s="MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAE">>> s_len = float(len(s))>>> G_num = s.count('G’)>>> A_num = s.count('A’)>>> freq_G = G_num/s_len>>> freq_A = A_num/s_len>>> print freq_G0.116666666667>>> print freq_A0.216666666667
>>> if freq_G > freq_A:... print "Gly is more frequent than Ala"... elif freq_G < freq_A:... print "Ala is more frequent than Gly"... else:... print "The frequency of Gly and Ala is the same"...Ala is more frequent than Glycines
Write different things depending on a condition
![Page 12: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/12.jpg)
The if/elif/else construct produces different effects compared with the use of a series of if conditions
![Page 13: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/13.jpg)
seq = open("SingleSeq.fasta")
for line in seq: if line[0] != '>': print line
seq = open("SingleSeq.fasta")
for line in seq: if line[0] == '>': print line
![Page 14: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/14.jpg)
seq = open("SingleSeq.fasta")
for line in seq: if line[0] != '>': print line
== != => <= > <
![Page 15: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/15.jpg)
Exercises 1, 2, and 3
1) Read a file in FASTA format and write to a new file only the header of the record.2) Read a file in FASTA format and write to a new file only the sequence (without the header).3) Merge 1) and 2). In other words, read a file in FASTA format and write the header to a file and the sequence to a different one.
![Page 16: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/16.jpg)
fasta = open('SingleSeq.fasta')header = open('header.txt', 'w’)
for line in fasta: if line[0] == '>': header.write(line) header.close()
![Page 17: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/17.jpg)
fasta = open('SingleSeq.fasta')seq = open('seq.txt','w')
for line in fasta: if line[0] != '>': seq.write(line)
seq.close()
![Page 18: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/18.jpg)
fasta = open('SingleSeq.fasta')header = open('header.txt', 'w')seq = open('seq.txt','w')
for line in fasta: if line[0] == '>': header.write(line) else: seq.write(line)
header.close()seq.close()
![Page 19: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/19.jpg)
Let’s increase the difficulty just a bit…
![Page 20: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/20.jpg)
seq_fasta = open("SingleSeq.fasta")
seq = ''
for line in seq_fasta: if line[0] == '>': header = line else: seq = seq + line.strip()
num_cys = seq.count("C")
print header, seq, num_cys
![Page 21: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/21.jpg)
Exercise 4
4) Read a file in FASTA format. Print or write the record to a file only if the sequence is from Homo sapiens.
![Page 22: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/22.jpg)
seq_fasta = open("SingleSeq.fasta")
seq = ''header = ''
for line in seq_fasta: if line[0] == '>': if "Homo sapiens" in line: header = line else:
if header: seq = seq + line
if header: print header + seqelse: print "The record is not from H. sapiens"
![Page 23: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/23.jpg)
In general, you will need to analyse several sequences….
![Page 24: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/24.jpg)
>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN.........
SwissProt-Human.fasta
![Page 25: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/25.jpg)
Read the records from a file and write them to a new file
fasta = open('SwissProt-Human.fasta')fasta_2 = open('SwissProt-Human_2.fasta', 'w')
for line in fasta:fasta_2.write(line)
this must be a string
![Page 26: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/26.jpg)
Strings can be concatenated
Strings can be indexed and sliced
String elements cannot be re-assigned
>>> print "ACTGGTA" + "ATGTAACTT"ACTGGTAATGTAACTT
>>> s = "ACTGGTA">>> s[0]'A'>>> s[1:3]'CT'
>>> s[2] = 'Z'Traceback (most recent call last): File "<stdin>", line 1, in <module>TypeError: 'str' object does not support item assignment
![Page 27: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/27.jpg)
Read the sequences from a file and write them to a new file
fasta = open('SwissProt-Human.fasta')fasta_2 = open('SwissProt-Human_2.fasta', 'w')
n = 0for line in fasta:
n = n + 1l_n = str(n)fasta_2.write(l_n + "\t" + line)
fasta_2.close()
Number the lines starting from 1
![Page 28: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/28.jpg)
>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN.........
![Page 29: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/29.jpg)
Exercise 5
5) Download a Uniprot multiple sequence FASTA file. Write the record headers to a new file.
![Page 30: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/30.jpg)
fasta = open('SwissProt-Human.fasta')headers = open('headers.txt', 'w')
for line in fasta:if line[0] == '>':
headers.write(line)
headers.close()
![Page 31: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/31.jpg)
>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN.........
![Page 32: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/32.jpg)
Exercise 6
6) Read a multiple sequence FASTA file and write the sequences to a new file separated by a blank line
![Page 33: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/33.jpg)
fasta = open('SwissProt-Human.fasta.fasta')seqs = open('seqs.txt', 'w')
for line in fasta: if line[0] == '>’: seqs.write('\n') elif line[0] != '>': seqs.write(line)seqs.close()
seqs.write(line.strip() + '\n’)
![Page 34: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/34.jpg)
Exercise 7
7) Read a multiple sequence FASTA file and write to a new file only the records from Homo sapiens.
![Page 35: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/35.jpg)
fasta = open('sprot_prot.fasta')output = open('homo_sapiens.fasta', 'w')
seq = ''
for line in fasta: if line[0] == '>' and seq == '': header = line elif line[0] != '>': seq = seq + line elif line[0] == '>' and seq != '': if "Homo sapiens" in header: output.write(header + seq) header = line seq = ''
if "Homo sapiens" in header: output.write(header + seq)
output.close()
![Page 36: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/36.jpg)
Exercise 8
8) Read FASTA records from a file and count the cysteine residues in each sequence.
![Page 37: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/37.jpg)
fasta = open('sprot_prot.fasta')
seq = ''
for line in fasta: if line[0] == '>' and seq == '': header = line[4:10] elif line[0] != '>': seq = seq + line.strip() elif line[0] == '>' and seq != '': cys_num = seq.count('C') print header, ': ', cys_num header = line[4:10] seq = ''
print header, ': ', cys_num
Read the records from a file and count the cysteine residues in each sequence
![Page 38: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/38.jpg)
Exercises 9, 10, and 11
9) Read a multiple sequence file in FASTA format and write to a new file only the records of the sequences starting with a methionine ('M').10) Read a multiple sequence file in FASTA format and write to a new file only the records of the sequences having at least two tryptophan residues ('W'). 11) Read a multiple sequence file in FASTA format and write to a new file only the records the sequences of which start with a methionine ('M') and have at least two tryptophans ('W').
![Page 39: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/39.jpg)
outfile = open('SwissProtHuman-Filtered.fasta','w')fasta = open('SwissProtHuman.fasta','r')
seq = ''
for line in fasta: if line[0:1] == '>' and seq == '': header = line elif line [0:1] != '>': seq = seq + line elif line[0:1] == '>' and seq != '':
TRP_num = seq.count('W') if seq[0] == 'M' and TRP_num > 1:
outfile.write(header + seq) seq = '' header = line
TRP_num = seq.count('W')if seq[0] == 'M' and TRP_num > 1: outfile.write(header + seq)outfile.close()
![Page 40: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/40.jpg)
In many cases you will need to compare data from different files
![Page 41: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/41.jpg)
>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN.........
SwissProt-Human.fasta
cancer-expressed.txt
![Page 42: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/42.jpg)
![Page 43: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/43.jpg)
1) Read 10 SwissProt ACs from a file2) Store them into a data structure
cancer_file = open('cancer-expressed.txt')
cancer_list = []
for line in cancer_file:AC = line.strip()cancer_list.append(AC)
print cancer_list
![Page 44: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/44.jpg)
List data structure
A list is a mutable ordered collection of objects
L = [1, [2,3], 4.52, ‘DNA’]
The elements of a list can be any kind of object: numbersstringstupleslistsdictionariesfunction callsetc.
L = [] The empty list
![Page 45: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/45.jpg)
![Page 46: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/46.jpg)
>>> L = [1,”hello”,12.1,[1,2,”three”],”seq”,(1,2)]>>> L[0] # indexing 1>>> L[3] # indexing[1, 2, ’three']>>> L[3][2] # indexing ‘three’>>> L[-1] # negative indexing(1, 2)>>> L[2:4] # slicing[12.1, [1, 2, ‘three’]]>>> L[2:] # slicing shorthand[12.1, [1, 2, ‘three’], ‘seq’, (1, 2)]>>>
![Page 47: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/47.jpg)
The elements of a list can be changed/replaced after the list has been defined
l[i] = xl[i:j] = tdel l[i:j]del l[i:j:k]l.append(x)l.extend(x)
>>> l = [2,3,5,7,8,['a','b'],'a','b','cde']>>> l[0] = 1>>> l[1, 3, 5, 7, 8, ['a', 'b'], 'a', 'b', 'cde']>>> l[0:3] = 'DNA'>>> l['D', 'N', 'A', 7, 8, ['a', 'b'], 'a', 'b', 'cde']>>> del l[0:5]>>> l[['a', 'b'], 'a', 'b', 'cde']>>> l.append('DNA')>>> l[['a', 'b'], 'a', 'b', 'cde', 'DNA']>>> l.extend('dna')>>> l[['a', 'b'], 'a', 'b', 'cde', 'DNA', 'd', 'n', 'a']>>>
These operations CHANGE the list
![Page 48: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/48.jpg)
l.count(x)l.index(x)l.insert(i, x)l.pop(i)l.remove(x)
>>> l = [1,3,5,7,8,['a','b'],'a','b','cde']>>> l.count(‘a’)>>> l1>>> l.index(8)4>>> l.insert(4, 80)>>> l[1, 3, 5, 7, 80, 8, [‘a’, ‘b’], ‘a’, ‘b’, ‘cde’]>>> l.pop(4)80>>> l[1, 3, 5, 7, 8, [‘a’, ‘b’], ‘a’, ‘b’, ‘cde’]>>> l.pop()‘cde’>>> l[1, 3, 5, 7, 8, [‘a’, ‘b’], ‘a’, ‘b’]>>> l.remove(8)[1, 3, 5, 7, [‘a’, ‘b’], ‘a’, ‘b’]
The elements of a list can be changed/replaced after the list has been defined
![Page 49: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/49.jpg)
l.reverse()l.sort()sorted(l)
>>> l = [4, 3, 2, 1, 5, 6, 7, 8]>>> l.reverse()>>> l[8, 7, 6, 5, 1, 2, 3, 4]>>> new = sorted(l)>>> new[1, 2, 3, 4, 5, 6, 7, 8]>>> l[8, 7, 6, 5, 1, 2, 3, 4]>>> l.sort()>>> l[1, 2, 3, 4, 5, 6, 7, 8]
The elements of a list can be changed/replaced after the list has been defined
![Page 50: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/50.jpg)
Putting together lists and loopsrange() and xrange() built-in functions
>>> range(10)[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]>>> range(1, 11)[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]>>> range(0, 30, 5)[0, 5, 10, 15, 20, 25]>>> range(0, 10, 3)[0, 3, 6, 9]>>> range(0, -10, -1)[0, -1, -2, -3, -4, -5, -6, -7, -8, -9]>>> range(0)[]>>> range(1, 0)[]# the xrange()method is more commonly used in for loops than range()>>>for i in xrange(5):… print i…0,1,2,3,4
The xrange()method generates the values upon call, i.e. it does not store them into a variable
![Page 51: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/51.jpg)
Exercise 12
12) Create a list containing Uniprot ACs extracted from a FASTA file. Print the list.
![Page 52: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/52.jpg)
InputFile = open("SwissProtHuman.fasta","r")AC_list = []for line in InputFile: if line[0] == '>': fields = line.split('|') AC_list.append(fields[1])print AC_list
![Page 53: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/53.jpg)
By the way…. Exercise 13
13) Read a file in FASTA format and copy to a new file the record ACs.
![Page 54: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/54.jpg)
human_fasta = open('SwissProt-Human.fasta')Outfile = open('SwissProt-Human-AC.txt’)
for line in human_fasta:if line[0] == '>':
AC = line.split('|')[1]Outfile.write(AC + '\n')
Outfile.close()
Selectively extract ACs froma a FASTA file
![Page 55: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/55.jpg)
Exercise 14
14) Read the human FASTA file one record after the other. Check if the record header contains one of the 10 ACs. If YES, copy the header to a new file.
![Page 56: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/56.jpg)
Read the human FASTA file one record after the other.Check if the record header contains one of the 10 ACs.If YES, copy the header to a new file.
cancer_file = open('cancer-expressed.txt')human_fasta = open('SwissProt-Human.fasta')Outfile = open(‘cancer-expressed.fasta’,’w’)cancer_list = []for line in cancer_file:
AC = line.strip()cancer_list.append(AC)
for line in human_fasta:if line[0] == '>':
AC = line.split('|')[1]if AC in cancer_list:
Outfile.write(line)Outfile.close()
We are not writing the whole record but the header line only
![Page 57: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/57.jpg)
>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN
SwissProt-Human.fasta
![Page 58: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/58.jpg)
Exercise 15
15) Read a multiple sequence file in FASTA format and write to a new file only the records the Uniprot ACs of which are present in the list created in 12).
![Page 59: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/59.jpg)
>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN
cancer_file = open('cancer-expressed.txt')human_fasta = open('SwissProt-Human.fasta')Outfile = open('cancer_expressed.fasta','w')
cancer_list = []
for line in cancer_file:AC = line.strip()cancer_list.append(AC)
for line in human_fasta: if line[0] == ">":
field = line.split("|")AC = field[1]if AC in cancer_list:
Outfile.write(line)else:
if AC in cancer_list:Outfile.write(line)
Outfile.close()
![Page 60: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/60.jpg)
>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN
cancer_file = open('cancer-expressed.txt')human_fasta = open('SwissProt-Human.fasta')Outfile = open('cancer_expressed.fasta','w')
cancer_list = []seq = ''
for line in cancer_file:AC = line.strip()cancer_list.append(AC)
for line in human_fasta:if line[0] == '>' and seq == '':
header = lineAC = line.split('|')[1]
elif line[0] != '>':seq = seq + line
elif line[0] == '>' and seq != '':if AC in cancer_list:
Outfile.write(header+seq)header = lineAC = line.split('|')[1]seq = ''
if AC in cancer_list:Outfile.write(header+seq)
The same but with more control…
![Page 61: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/61.jpg)
![Page 62: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/62.jpg)
Extract and write to a file the gene sequence from the Candida albicans genomic DNA, chromosome 7, complete sequence (file ap006852.gbk)
Try to write it in FASTA format:
>AP006852CcactgtccaatggctcaacacgccaatcatcatacaatacccccaacaggaatcaccaaagtactgatgcttctcactatcaatagtttgtactttcaccacacaatagcagatgatccatctaaatccaccttcctatcgatcgtgaccacccccataaaataggtcaactccataaacacctccatcaccaacgctagactcacaacccagaacatgttaatcaaccggtgggccaaGtaccgttgtagctctctcgtaaacacaagaaccaacaccaaacaacatactacaactga......
![Page 63: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/63.jpg)
Exercise 16
16) Read a Genbank record and write to a file the nucleotide sequence in FASTA format.
![Page 64: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/64.jpg)
InputFile = open("ap006852.gbk")OutputFile = open("ap006852.fasta","w")flag = 0
for line in InputFile:if line[0:9] == 'ACCESSION':
AC = line.split()[1].strip()OutputFile.write('>'+AC+'\n')
if line[0:6] == 'ORIGIN': flag = 1continue
if flag == 1:fields = line.split()if fields != []:
seq = ''.join(fields[1:])OutputFile.write(seq +'\n')
InputFile.close()OutputFile.close()
![Page 65: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/65.jpg)
Parsing data records
• Start by visually inspecting the file you want to parse
• Identify the information you want to extract
• Identify separators to select your information using if conditions
• Use lists if you have to compare data from different files
![Page 66: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/66.jpg)
cancer_file = open('cancer-expressed.txt')
cancer_list = []line = cancer_file.readline()while line:
AC = line.strip()cancer_list.append(AC)line = cancer_file.readline()
We can use while loops to read files(but usually we won’t do it)
![Page 67: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/67.jpg)
You can repeat all exercises using ncbi_gene.fasta as input file
![Page 68: Parsing data records](https://reader035.fdocuments.in/reader035/viewer/2022062321/5681653b550346895dd7bd46/html5/thumbnails/68.jpg)
Summary
• Parsing sequence records in FASTA format
• Lists
• Making choices: if/elif/else
• range() and xrange()