Hon J. , Marušiak M. , Martínek T. , Zendulka J. , Bednář...
Transcript of Hon J. , Marušiak M. , Martínek T. , Zendulka J. , Bednář...
Prediction of Protein Solubility
Motivation & objective• protein solubility is a major bottleneck in
production of many therapeutic and industrially attractive proteins
• attempts of experimental solubilization often unsuccessful and expensive
• objective: reduce costs of proteomic studies by computational prioritization of protein sequences
Hon J.1,2,3, Marušiak M.3, Martínek T.3, Zendulka J.3, Bednář D.1,2, Damborský J.1,2
1 Loschmidt Laboratories, Centre for Toxic Compounds in the Environment RECETOX and Department of Experimental Biology, Faculty of Science, Masaryk University, 625 00 Brno, Czech Republic2 International Clinical Research Center, St. Annes’s University Hospital Brno, 656 91 Brno, Czech Republic
3 IT4Innovations Centre of Excellence, Faculty of Information Technology, Brno University of Technology, 612 66 Brno, Czech Republic
SoluProt predictor• 36 sequence-based features: mostly amino acid content +
predicted disorder, alpha-helix and beta-sheet content, sequence identity to PDB and several aggregated physico-chemical properties
• random forest regression model
• best accuracy of all existing predictors (58.2% on test set)
• useful for protein prioritization
• implementation in Python available upon request (work in progress)
Outlook• novel features based on predicted tertiary structure
• implementation of an easy-to-use webserver
• experimental validation using a set of 40 putative haloalkane dehalogenases
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
1 − specificity
sens
itivi
ty
BioTechccSOL omicsDeepSolESPRESSOmWHPROSOIIProtein−solSOLproSoluProt
Figure 2. ROC curves of SoluProt and of other sequence-based tools measured on the test set.
0%
25%
50%
75%
100%
0% 25% 50% 75% 100%
worst sequences removed
succ
ess
incr
ease
BioTechccSOL omicsDeepSol
ESPRESSOmWHPROSOII
Protein−SolSOLproSoluProt
Figure 1. Increase of soluble sequences when using solubility prediction tools for prioritization
on SoluProt test set. Removing 90% worst sequences from the test set (using SoluProt
prediction) will increase the number of soluble outcomes by 49.9% in comparison to blind
selection.
References & acknowledgements1. Helen M. Berman, M. J. G., & Protein Structure Initiative Network of Investigators. (2017). Protein Structure Initiative – Targettrack 2000–2017 –
All Data Files [Data set]. Zenodo. DOI: 10.5281/zenodo.8216542. Price, W. N., Handelman, S. K., Everett, J. K., Tong, S. N., Bracic, A., Luff , J. D., Hunt, J. F. (2011). Large-scale experimental studies show unexpected
amino acid eff ects on protein expression and solubility in vivo in E. coli. Microb Inform and Exp, 1(1), 6. DOI: 10.1186/2042-5783-1-6
SoluProt training set
TargetTrack database [1]more than 300,000 records from
the structural genomics projects
Keep sequences expressed in E. coli onlymatching algorithm based on keyword scoring and
manual checking of protocol descriptions
Determine solubilityalgorithm based on trial status and trial stop status
Remove short sequences and sequences with undefi ned residues or transmembrane regions
Apply PDB correctiondiscard insoluble sequences found in current
version of PDB database
Split by the solubility
Cluster to 25% identity Cluster to 25% identity
Join into a single dataset
Balance the number of negative and positive samples and equalize their sequence length distributions
Training set10,912 protein sequences
SoluProt test set• based on NESG [2] dataset – set of
9,703 proteins expressed in E. col i
• processed with similar workfl ow as the training set – only without the fi rst two steps
• 3,788 sequences in the fully independent test set after processing
Table 1. Performance of sequence-based solubility prediction tools on SoluProt test set. AUC, the best possible accuracy and success
increase when 90% worst sequences are removed from the set are presented.
Tool AUC Accuracy Succ. inc.
SoluProt 0.61 58.2% 49.9%
PROSOII 0.60 57.2% 43.0%
ESPRESSO* 0.57 55.5% 23.5%
DeepSol 0.55 54.1% 30.3%
Protein-Sol 0.54 53.6% 20.8%
mWH 0.54 53.9% 13.5%
SOLpro 0.52 52.3% 2.4%
ccSOL omics 0.51 51.7% -1%
BioTech 0.50 50.4% -3.4%
*Property-based ESPRESSO prediction.
This work was supported by ELIXIR CZ research infrastructure project (MEYS Grant No: LM2015047) including access to computing and storage facilities.Computational resources were supplied by the CESNET LM2015042 and the CERIT Scientifi c Cloud LM2015085, provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures”.
Contact
Conference ELIXIR CZ 2017 Třešť, 15 – 16 Nov, 2017
ELIXIR CZ
Flemingovo nám. 2166 10 Praha 6Czech [email protected]
www.elixir-czech.cz
CESNET, z.s.p.o.Masaryk University – CEITEC a CERIT-SCCharles UniversityPalacky University OlomoucInstitute of Microbiology of the CASInstitute of Biotechnology CASInstitute of Molecular Genetics of the CASBiology Centre CASUniversity Hospital at St. Anna in Brno - ICRCUniversity of Chemistry and TechnologyUniversity of South Bohemia University of West BohemiaCzech Technical University in Prague - FIT
Learn more and register at https://www.elixir-czech.cz/events
ELIXIR CZ Members
Organised with the support of MŠMT – largeinfrastructure project ELIXIR-CZ (Grant LM2015047)
OrganiserInstitute of Organic Chemistryand Biochemistry of the CAS
• Plenary lectures by Prof. H. Berman and Prof. B. Mons
• Presentations of partner infrastructures
• Portfolio of ELIXIR CZ services
• Flash talks on research undertaken by the ELIXIR CZ community
- Structural Bioinformatics - Genomics- Metabolomics- Cheminformatics- Proteomics- Clinical research