P4 2017 io

Post on 23-Jan-2018

1.481 views 1 download

Transcript of P4 2017 io

FBW

07-11-2017

Wim Van Criekinge

BPC 2017

Recap

if condition:

statements

[elif condition:

statements] ...

else:

statements

while condition:

statements

for var in sequence:

statements

break

continue

Strings

REGULAR EXPRESSIONS

Devhints.io

Towards a protein prosite scanner

N-{P}-[ST]-{P}.[RK](2)-x-[ST].

[ST]-x-[RK].

[ST]-x(2)-[DE].

[RK]-x(2,3)-[DE]-x(2,3)-Y.G-{EDRKHPFYW}-x(2)-[STAGCN]-{P}.x-G-[RK]-[RK].

C-x-[DN]-x(4)-[FY]-x-C-x-C.

E-x(2)-[ERK]-E-x-C-x(6)-[EDR]-x(10,11)-[FYA]-[YW].

[DEQGSTALMKRH]-[LIVMFYSTAC]-[GNQ]-[LIVMFYAG]-[DNEKHS]-S-[LIVMST]-{PCFY}-[STAGCPQLIVMF]-[LIVMATN]-[DENQGTAKRHLM]-[LIVMWSTA]-

[LIVGSTACR]-{LPIY}-{VY}-[LIVMFA].

[KRHQSA]-[DENQ]-E-L>.

R-G-D.

[AG]-x(4)-G-K-[ST].

D-{W}-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}-[LIVMC]-[DENQSTAGC]-x(2)-[DE]-[LIVMFYW].

[EQ]-{LNYH}-x-[ATV]-[FY]-{LDAM}-{T}-W-{PG}-N.[LIVM]-x-[SGNL]-[LIVMN]-[DAGHENRS]-[SAGPNVT]-x-[DNEAG]-[LIVM]-x-[DEAGQ]-x(4)-[LIVM]-x-[LM]-[SAG]-[LIVM]-[LIVMT]-[WS]-x(0,1)-[LIVM](2).

[FY]-C-[RH]-[NS]-x(7,8)-[WY]-C.

C-x-C-x(2)-{V}-x(2)-G-{C}-x-C.

C-x(2)-P-F-x-[FYWIV]-x(7)-C-x(8,10)-W-C-x(4)-[DNSR]-[FYW]-x(3,5)-[FYW]-x-[FYWI]-C.

[LIFAT]-{IL}-x(2)-W-x(2,3)-[PE]-x-{VF}-[LIVMFY]-[DENQS]-[STA]-[AV]-[LIVMFY].

[KRH]-x(2)-C-x-[FYPSTV]-x(3,4)-[ST]-x(3)-C-x(4)-C-C-[FYWH].

C-x(4,5)-C-C-S-x(2)-G-x-C-G-x(3,4)-[FYW]-C.

[LIVMFYG]-[ASLVR]-x(2)-[LIVMSTACN]-x-[LIVM]-{Y}-x(2)-{L}-[LIV]-[RKNQESTAIY]-[LIVFSTNKH]-W-[FYVC]-x-[NDQTAH]-x(5)-[RKNAIMW].

C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H.L-x(6)-L-x(6)-L-x(6)-L.

C-x(2)-C-x(1,2)-[DENAVSPHKQT]-x(5,6)-[HNY]-[FY]-x(4)-C-x(2)-C-x(2)-F(2)-x-R.[LIVMFE]-[FY]-P-W-M-[KRQTA].

L-M-A-[EQ]-G-L-Y-N.

IRED_1R-P-C-x(11)-C-V-S.

[RKQ]-R-[LIM]-x-[LF]-G-[LIVMFY]-x-Q-x-[DNQ]-V-G.

[KR]-x(1,3)-[RKSAQ]-N-{VL}-x-[SAQ](2)-{L}-[RKTAENQ]-x-R-{S}-[RK].

[LIVMF](2)-D-E-A-D-[RKEN]-x-[LIVMFYGSTN].

[KRQ]-[LIVMA]-x(2)-[GSTALIV]-{FYWPGDN}-x(2)-[LIVMSA]-x(4,9)-[LIVMF]-x-{PLH}-[LIVMSTA]-[GSTACIL]-{GPK}-{F}-x-[GANQRF]-[LIVMFY]-x(4,5)-[LFY]-x(3)-

[FYIVA]-{FYWHCM}-{PGVI}-x(2)-[GSADENQKR]-x-[NSTAPKL]-[PARL].

Scan for the following prosite patterns in your 4 sequences

Hint: translate the patters to regexes and then scan

reuse galacto.py in github

Consensus_pattern="G-R-x-N-[LIV]-I-G-[DE]-H-x-D-Y"

pattern=Consensus_pattern.replace("-","")

pattern=pattern.replace("x","[A-Z]")

#print(pattern)

count=0

for s in sequences:

count=count+1

print ("searching seq",count)

s=s.replace(" ","")

matches = re.finditer(pattern,s)

for match in matches:

print (match.group(0),"from: ",match.start(),"to: ",match.end())

>SEQ1

MGNLFENCTHRYSFEYIYENCTNTTNQCGLIRNVASSIDVFHWLDVYISTTIFVISGILNFYCLFIALYT YYFLDNETRKHYVFVLSRFLSSILVIISLLVLESTLFSESLSPTFAYYAVAFSIYDFSMDTLFFSYIMIS LITYFGVVHYNFYRRHVSLRSLYIILISMWTFSLAIAIPLGLYEAASNSQGPIKCDLSYCGKVVEWITCS LQGCDSFYNANELLVQSIISSVETLVGSLVFLTDPLINIFFDKNISKMVKLQLTLGKWFIALYRFLFQMT NIFENCSTHYSFEKNLQKCVNASNPCQLLQKMNTAHSLMIWMGFYIPSAMCFLAVLVDTYCLLVTISILK SLKKQSRKQYIFGRANIIGEHNDYVVVRLSAAILIALCIIIIQSTYFIDIPFRDTFAFFAVLFIIYDFSILSLLGSFTGVAM MTYFGVMRPLVYRDKFTLKTIYIIAFAIVLFSVCVAIPFGLFQAADEIDGPIKCDSESCELIVKWLLFCI ACLILMGCTGTLLFVTVSLHWHSYKSKKMGNVSSSAFNHGKSRLTWTTTILVILCCVELIPTGLLAAFGK SESISDDCYDFYNANSLIFPAIVSSLETFLGSITFLLDPIINFSFDKRISKVFSSQVSMFSIFFCGKR

>SEQ2

MLDDRARMEA AKKEKVEQIL AEFQLQEEDL KKVMRRMQKE MDRGLRLETH EEASVKMLPT YVRSTPEGSE VGDFLSLDLG GTNFRVMLVK VGEGEEGQWS VKTKHQMYSI PEDAMTGTAE MLFDYISECI SDFLDKHQMK HKKLPLGFTF SFPVRHEDID KGILLNWTKG FKASGAEGNN VVGLLRDAIK RRGDFEMDVV AMVNDTVATM ISCYYEDHQC EVGMIVGTGC NACYMEEMQN VELVEGDEGR MCVNTEWGAF GDSGELDEFL LEYDRLVDES SANPGQQLYE KLIGGKYMGE LVRLVLLRLV DENLLFHGEA SEQLRTRGAF ETRFVSQVES DTGDRKQIYN ILSTLGLRPS TTDCDIVRRA CESVSTRAAH MCSAGLAGVI NRMRESRSED VMRITVGVDG SVYKLHPSFK ERFHASVRRL TPSCEITFIE SEEGSGRGAA LVSAVACKKA CMLGQ

>SEQ3

MESDSFEDFLKGEDFSNYSYSSDLPPFLLDAAPCEPESLEINKYFVVIIYVLVFLLSLLGNSLVMLVILY SRVGRSGRDNVIGDHVDYVTDVYLLNLALADLLFALTLPIWAASKVTGWIFGTFLCKVVSLLKEVNFYSGILLLACISVDRY LAIVHATRTLTQKRYLVKFICLSIWGLSLLLALPVLIFRKTIYPPYVSPVCYEDMGNNTANWRMLLRILP QSFGFIVPLLIMLFCYGFTLRTLFKAHMGQKHRAMRVIFAVVLIFLLCWLPYNLVLLADTLMRTWVIQET CERRNDIDRALEATEILGILGRVNLIGEHWDYHSCLNPLIYAFIGQKFRHGLLKILAIHGLISKDSLPKDSRPSFVGSSSGH TSTTL

>SEQ4

MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA

sequences

10

Reading Files

name = open("filename")

– opens the given file for reading, and returns a file object

name.read() - file's entire contents as a string

name.readline() - next line from file as a string

name.readlines() - file's contents as a list of lines

– the lines from a file object can also be read using a for loop

>>> f = open("hours.txt")

>>> f.read()

'123 Susan 12.5 8.1 7.6 3.2\n

456 Brad 4.0 11.6 6.5 2.7 12\n

789 Jenn 8.0 8.0 8.0 8.0 7.5\n'

11

File Input Template

• A template for reading files in Python:

name = open("filename")

for line in name:

statements

>>> input = open("hours.txt")

>>> for line in input:

... print(line.strip()) # strip() removes \n

123 Susan 12.5 8.1 7.6 3.2

456 Brad 4.0 11.6 6.5 2.7 12

789 Jenn 8.0 8.0 8.0 8.0 7.5

12

Writing Files

name = open("filename", "w")name = open("filename", "a")

– opens file for write (deletes previous contents), or

– opens file for append (new data goes after previous data)

name.write(str) - writes the given string to the file

name.close() - saves file once writing is done

>>> out = open("output.txt", "w")>>> out.write("Hello, world!\n")>>> out.write("How are you?")>>> out.close()

>>> open("output.txt").read()'Hello, world!\nHow are you?'

https://prosite.expasy.org

• Where to put the files ?

Swiss-Knife.py

• Using a database as input ! Parse

the entire Swiss Prot collection

– How many entries are there ?

– Average Protein Length (in aa and

MW)

– Relative frequency of amino acids

• Compare to the ones used to construct

the PAM scoring matrixes from 1978 –

1991

Amino acid frequencies

1978 1991

L 0.085 0.091

A 0.087 0.077

G 0.089 0.074

S 0.070 0.069

V 0.065 0.066

E 0.050 0.062

T 0.058 0.059

K 0.081 0.059

I 0.037 0.053

D 0.047 0.052

R 0.041 0.051

P 0.051 0.051

N 0.040 0.043

Q 0.038 0.041

F 0.040 0.040

Y 0.030 0.032

M 0.015 0.024

H 0.034 0.023

C 0.033 0.020

W 0.010 0.014

Second step: Frequencies of Occurence

Getting the database

FASTA: Uniprot_sprot.fasta – 268Mb

TEXT: Uniprot_sprot.dat – zipped (560

Mb) unzipped (3Gb)

http://www.ebi.ac.uk/uniprot/download-center