Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins...

25
J Grid Computing (2015) 13:561–585 DOI 10.1007/s10723-015-9353-8 Scaling Ab Initio Predictions of 3D Protein Structures in Microsoft Azure Cloud Dariusz Mrozek · Pawel Gosk · Bo˙ zena Malysiak-Mrozek Received: 7 October 2014 / Accepted: 9 October 2015 / Published online: 12 November 2015 © The Author(s) 2015. This article is published with open access at Springerlink.com Abstract Computational methods for protein structure prediction allow us to determine a three-dimensional structure of a protein based on its pure amino acid sequence. These methods are a very important alter- native to costly and slow experimental methods, like X-ray crystallography or Nuclear Magnetic Reso- nance. However, conventional calculations of protein structure are time-consuming and require ample com- putational resources, especially when carried out with the use of ab initio methods that rely on physical forces and interactions between atoms in a protein. Fortunately, at the present stage of the development of computer science, such huge computational resources are available from public cloud providers on a pay- as-you-go basis. We have designed and developed a scalable and extensible system, called Cloud4PSP, which enables predictions of 3D protein structures in the Microsoft Azure commercial cloud. The sys- tem makes use of the Warecki-Znamirowski method as a sample procedure for protein structure predic- tion, and this prediction method was used to test the scalability of the system. The results of the efficiency This work was supported by Microsoft Research within Microsoft Azure for Research Award. A part of the work was performed using the infrastructure supported by POIG.02.03.01-24-099/13 grant: “GeCONiI - Upper Sile- sian Center for Computational Science and Engineering”. D. Mrozek () · P. Gosk · B. Malysiak-Mrozek Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland e-mail: [email protected] tests performed proved good acceleration of predic- tions when scaling the system vertically and horizon- tally. In the paper, we show the system architecture that allowed us to achieve such good results, the Cloud4PSP processing model, and the results of the scalability tests. At the end of the paper, we try to answer which of the scaling techniques, scaling out or scaling up, is better for solving such computational problems with the use of Cloud computing. Keywords Bioinformatics · Proteins · 3D protein structure · Protein structure prediction · Tertiary structure prediction · Ab initio · Protein structure modeling · Cloud computing · Distributed computing · Scalability · Microsoft Azure 1 Introduction Protein structure prediction is one of the most impor- tant and yet difficult processes for modern computa- tional biology and structural bioinformatics [21]. The practical role of protein structure prediction is becom- ing even more important in the face of the dynami- cally growing number of protein sequences obtained through the translation of DNA sequences coming from large-scale sequencing projects. Needless to say, experimental methods for the determination of protein structures, such as X-ray crystallography or Nuclear Magnetic Resonance (NMR), are lagging behind the number of protein sequences. As a consequence, the

Transcript of Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins...

Page 1: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

J Grid Computing (2015) 13:561–585DOI 10.1007/s10723-015-9353-8

Scaling Ab Initio Predictions of 3D Protein Structuresin Microsoft Azure Cloud

Dariusz Mrozek · Paweł Gosk · BozenaMałysiak-Mrozek

Received: 7 October 2014 / Accepted: 9 October 2015 / Published online: 12 November 2015© The Author(s) 2015. This article is published with open access at Springerlink.com

Abstract Computational methods for protein structureprediction allow us to determine a three-dimensionalstructure of a protein based on its pure amino acidsequence. These methods are a very important alter-native to costly and slow experimental methods, likeX-ray crystallography or Nuclear Magnetic Reso-nance. However, conventional calculations of proteinstructure are time-consuming and require ample com-putational resources, especially when carried out withthe use of ab initio methods that rely on physicalforces and interactions between atoms in a protein.Fortunately, at the present stage of the development ofcomputer science, such huge computational resourcesare available from public cloud providers on a pay-as-you-go basis. We have designed and developed ascalable and extensible system, called Cloud4PSP,which enables predictions of 3D protein structuresin the Microsoft Azure commercial cloud. The sys-tem makes use of the Warecki-Znamirowski methodas a sample procedure for protein structure predic-tion, and this prediction method was used to test thescalability of the system. The results of the efficiency

This work was supported by Microsoft Research withinMicrosoft Azure for Research Award. A part of thework was performed using the infrastructure supported byPOIG.02.03.01-24-099/13 grant: “GeCONiI - Upper Sile-sian Center for Computational Science and Engineering”.

D. Mrozek (�) · P. Gosk · B. Małysiak-MrozekInstitute of Informatics, Silesian University of Technology,Akademicka 16, 44-100 Gliwice, Polande-mail: [email protected]

tests performed proved good acceleration of predic-tions when scaling the system vertically and horizon-tally. In the paper, we show the system architecturethat allowed us to achieve such good results, theCloud4PSP processing model, and the results of thescalability tests. At the end of the paper, we try toanswer which of the scaling techniques, scaling outor scaling up, is better for solving such computationalproblems with the use of Cloud computing.

Keywords Bioinformatics · Proteins · 3D proteinstructure · Protein structure prediction · Tertiarystructure prediction · Ab initio · Protein structuremodeling · Cloud computing · Distributedcomputing · Scalability · Microsoft Azure

1 Introduction

Protein structure prediction is one of the most impor-tant and yet difficult processes for modern computa-tional biology and structural bioinformatics [21]. Thepractical role of protein structure prediction is becom-ing even more important in the face of the dynami-cally growing number of protein sequences obtainedthrough the translation of DNA sequences comingfrom large-scale sequencing projects. Needless to say,experimental methods for the determination of proteinstructures, such as X-ray crystallography or NuclearMagnetic Resonance (NMR), are lagging behind thenumber of protein sequences. As a consequence, the

Page 2: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

562 D. Mrozek et al.

number of protein structures in repositories, suchas the world-wide Protein Data Bank (PDB) [3], isonly a small percentage of the number of all knownsequences. Therefore, computational procedures thatallow for the determination of a protein structurefrom its amino acid sequence are great alternatives forexperimental methods for protein structure determina-tion, like X-ray crystallography and NMR.

Protein structure prediction refers to the compu-tational procedure that delivers a three-dimensionalstructure of a protein based on its amino acid sequence(Fig. 1). There are various approaches to the prob-lem and many algorithms have been developed. Thesemethods generally fall into two groups: (1) physicaland (2) comparative [35, 72]. Physical methods relyon physical forces and interactions between atoms ina protein. Most of them try to reproduce nature’s algo-rithm and implement it as a computational procedurein order to give proteins their unique 3D native confor-mations [43]. Following the rules of thermodynamics,it is assumed that the native conformation of a pro-tein is the one that possesses the minimum potentialenergy. Therefore, physical methods try to find theglobal minimum of the potential energy function [41].The functional form of the energy is described byempirical force fields [10] that define the mathemat-ical function of the potential energy, its componentsdescribing various interactions between atoms, andthe parameters needed for computations of each ofthe components (see Section 2.1 for details). The

Input

Modeling Software

Output

Methods:1. Physical (ab initio)2. Comparative

Homology modelingFold recognitionSecondary structure

Fig. 1 3D protein structure prediction: from amino acidsequence (Input) to 3D structure (Output). Modeling softwareuses different methods for prediction purposes: 1 physical, 2comparative

potential energy of an atomic system is composedof several components depending on the force fieldtype. Methods that rely on such a physical approachbelong to so-called ab initio protein structure predic-tion methods. Representatives of the approach includeI-TASSER [68], Rosetta@home [42], Quark [69], andmany others, e.g. [66] and [73]. In the paper, we willfocus on this approach.

On the other hand, comparative methods rely onalready known structures that are deposited in macro-molecular data repositories, such as the Protein DataBank (PDB). Comparative methods try to predict thestructure of the target protein by finding homologsamong sequences of proteins of already determinedstructures and by ”dressing” the target sequence in oneof the known structures. Comparative methods canbe split into several groups including: (1) homologymodeling methods, e.g. Swiss-Model [2], Modeller[14], RaptorX [32], Robetta [36], HHpred [63], (2)fold recognition methods, like Phyre [34], Raptor [70],and Sparks-X [71], and (3) secondary structure predic-tion methods, e.g. PREDATOR [18], GOR [19], andPredictProtein [56].

Both physical and comparative approaches are stillbeing developed and their effectiveness is assessedevery two years in the CASP experiment (Criti-cal Assessment of protein Structure Prediction). Itmust also be admitted that ab initio prediction meth-ods require significant computational resources, eitherpowerful supercomputers [53, 60] or distributed com-puting. The proof of the latter give projects suchas Folding@home [62] and Rosetta@Home [9] thatmake use of Grid computing architecture.

A promising alternative seems also to be providedby Cloud computing, which allows hardware infras-tructure to be leased in the pay-as-you-go model.Cloud computing is a model for enabling convenient,on-demand network access to a shared pool of config-urable computing resources (e.g., networks, servers,storage, applications, and services) that can be rapidlyprovisioned and released with minimal managementeffort or service provider interaction [47]. In practice,Cloud computing allows applications and services tobe run on a distributed network using a virtualized sys-tem and its resources, and at the same time, allows toabstract away from the implementation details of thesystem itself. Cloud computing emerged as a result

Page 3: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

Scaling Predictions of Protein Structures in Microsoft Azure Cloud 563

of requirements for the public availability of comput-ing power, new technologies for data processing andthe need for their global standardization, becominga mechanism allowing control of the development ofhardware and software resources by introducing theidea of virtualization. The use of cloud platforms canbe particularly beneficial for companies and institu-tions that need to quickly gain access to a computersystem which has a higher than average computingpower. In this case, the use of Cloud computing ser-vices can be faster in implementation than using theowned resources (servers and computing clusters) orbuying new ones. For this reason, Cloud computingis widely used in business and, according to Forbes,the market value of such services will significantlyincrease in the coming years [46].

1.1 Related Works

Cloud computing evolved from various forms of dis-tributed computing, including Grid computing. Both,Clouds and Grids provide plenty of computationaland storage resources, which is very attractive forscientific computations performed in the domain ofLife sciences. Therefore, both computing modelsbecame popular in scientific applications for whicha large pool of computational resources allows vari-ous computationally intensive problems to be solved.In the domain of Life sciences there are many ded-icated cloud-based and grid-based tools that allowus to solve problems, such as whole genome andmetagenome sequence analysis [1], gene detection[45], identification of peptide sequences from spec-tra in mass spectrometry [44], analysis of data fromDNA microarray experiments [33], mapping next-generation sequence data to the human genome andother reference genomes, for use in a variety of bio-logical analyses including SNP discovery, genotyp-ing and personal genomics [13, 57], gene sequenceanalysis and protein characterization [28], 3D ligandbinding site comparison and similarity searching ofa structural proteome [24], molecular docking [4, 8],protein structure similarity searching and structuralalignment [25, 50–52], and many others.

In the domain of protein structure prediction, itis worth noting two cloud-based solutions. The firstone is an open-source Cloud BioLinux [38], which

is a publicly accessible Virtual Machine (VM) thatenables scientists to quickly provision on-demandinfrastructures for high-performance bioinformaticscomputing using cloud platforms. Cloud BioLinuxprovides a range of pre-configured command lineand graphical software applications, including a full-featured desktop interface, documentation and over135 bioinformatics packages for applications includ-ing sequence alignment, clustering, assembly, display,editing, and phylogeny. The release of the CloudBioLinux reported in [31] consists of the alreadymentioned PredictProtein suite that includes methodsfor the prediction of secondary structure and sol-vent accessibility (profphd), nuclear localization sig-nals (predictnls), and intrinsically disordered regions(norsnet).

The second solution is Rosetta@Cloud [27].Rosetta@Cloud is a commercial, cloud-based pay-as-you-go server for predicting protein 3D struc-tures using ab initio methods. Services provided byRosetta@Cloud are analogous to the well-establishedRosetta computational software suite, with a variety oftools developed for macromolecular modeling, struc-ture prediction and functional design. Rosetta@Cloudwas originally designed to work on the Cloud providedby Amazon Web Service (AWS). It delivers a scalableenvironment, fully functional graphical user interfaceand a dedicated pricing model for using computingunits (AWS Instances) in the prediction process.

Great progress in scaling molecular simulations hasalso been brought by scientific initiatives that makeuse of the Grid computing architecture. Lagana etal. [39] presented the foundations and structures ofthe building blocks of the ab initio Grid EmpoweredMolecular Simulator (GEMS) implemented on theEGEE computing Grid. In GEMS the computationalproblem is split into a sequence of three computa-tional blocks, namely INTERACTION, DYNAMICS,and OBSERVABLES. Each of the blocks has differentpurpose. Ab initio calculations determining the struc-ture of a molecular system are performed in the firstblock. The main efforts to implement GEMS on theEGEE Grid were made to the second block, which isresponsible for the calculation of the dynamics of themolecular system. The authors show that for variousreasons the parallelization of molecular simulations ispossible through the application of a simple parame-

Page 4: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

564 D. Mrozek et al.

ter sweeping distribution scheme, which is also usedin our system.

WeNMR [67] is an example of a successful ini-tiative for moving the analysis of Nuclear MagneticResonance (NMR) and Small Angle X-Ray scattering(SAXS) imaging data for atomic and near-atomic res-olution molecular structures to the Grid. The WeNMRmakes use of an operational Grid that was initiallybased on the gLite middleware, developed in thecontext of the EGEE project series [16]. Although,WeNMR focuses on NMR and SAXS, its web por-tal provides access to various services and tools,including those used for the prediction of biologicalcomplexes (HADDOCK [11]), calculating structuresfrom NMR data (Xplor-NIH [58], CYANA [22] andCS-ROSETTA [61]), molecular structure refinementand molecular dynamics simulations (AMBER [6] andGROMACS [65]). A typical workflow is based onGrid job pooling. A specialized process is listeningfor Grid job packages that should be executed in theGrid, another process is periodically checking runningjobs for their status, retrieving the results when ready,and finally, another process performs post-processingof results before they are presented to the user. In thepaper, we will present the system that uses a sim-ilar workflow scheme, with prediction jobs that aredivided into many prediction tasks. These tasks areenqueued in the system for further execution by pro-cessing units that work in a dedicated role-based andqueue-based architecture that we designed.

Gesing et al. [20] report on a very important secu-rity issue while carrying out computations in struc-tural bioinformatics, molecular modeling, and quan-tum chemistry with the use of Molecular SimulationGrid (MoSGrid). MoSGrid is a science gateway thatmakes use of WS-PGRADE [15] as a user interface,gUSE [30] as a high-level middleware service layerfor workload storage, workload execution, monitor-ing and others, UNICORE [64] as a Grid resourcesmiddleware layer, and XtreemeFS [26] as a file sys-tem for storage purposes. Various security elementsare employed in particular layers of the MoSGridsecurity infrastructure with the aim of protecting intel-lectual property of companies performing scientificcalculations.

These projects prove that computations per-formed in bioinformatics require ample computational

resources that are available through the use of Gridor Cloud computing. In the paper, we present a newlydeveloped Cloud4PSP (Cloud for Protein StructurePrediction) system. Cloud4PSP is a cloud-based com-putational framework for 3D protein structure predic-tion. Cloud4PSP was designed and developed for theMicrosoft Azure commercial Cloud. In the paper, weshow the architecture of the Cloud4PSP, its uniquefeatures, and the advantages of using the queue-basedarchitecture instead of direct communication. We alsopresent the processing model used by the system forthose who would like to extend the system with newprediction methods. Finally, we show the scalabil-ity of the system based on the designed architectureon the example of sample prediction processes per-formed with the use of the ab initio prediction methoddesigned by S. Warecki and L. Znamirowski [66].

The paper presents this system, which comple-ments the collection of available Cloud-based andGrid-based tools by:

– sharing a dedicated architecture of a Cloud-based,scalable and extensible system focused on proteinstructure prediction, oriented to the specificity ofthe process,

– showing the benefits of building the system withthe use of roles and queues,

– providing a processing model that involves thecreation of prediction jobs with accompanyingcollections of prediction tasks,

– delivering a ready-to-deploy Azure solution,whose good scalability in the Azure cloud wasproved by a series of tests.

1.2 Microsoft Azure

Microsoft Azure is a commercial cloud platformthat delivers services for building scalable web-basedapplications. Microsoft Azure allows the develop-ment, deployment and management of applicationsand services through a network of data centers locatedin various countries throughout the world. MicrosoftAzure is a public Cloud, which means that the infras-tructure of the Cloud is available for public use and isowned by Microsoft selling cloud services. MicrosoftAzure provides computing resources in a virtualizedform, including processing power, RAM, space and

Page 5: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

Scaling Predictions of Protein Structures in Microsoft Azure Cloud 565

appropriate bandwidth for transferring data over thenetwork, within the Infrastructure as a Service (IaaS)service model [47]. Moreover, within the Platformas a Service model [47], Azure also delivers a plat-form and dedicated cloud service programming modelfor developing applications that should work on theCloud.

In Fig. 2 we can see types of services andresources provided to a developer of the applicationdeployed to Microsoft Azure cloud. Below we brieflydescribe the Microsoft Azure cloud resources usedin our project:

– Compute: Compute/Cloud Services representapplications that are designed to run on the Cloudand XML configuration files that define how thecloud service should run. The Microsoft Azureprogramming model provides an abstraction forbuilding cloud applications. Each cloud appli-cation is defined in terms of component rolesthat implement the logic of the application. Con-figuration files define the roles and resourcesfor an application. There are two types of rolesthat can be used to implement the logic of theapplication:

– Web role - is a virtual machine (VM)instance used for providing a web basedfront-end for the cloud service;

– Worker role - is a virtual machine (VM)instance used for generalized develop-ment that performs background process-ing and scalable computations, acceptsand responds to requests, and performslong-running or intermittent tasks.

– Storage: Microsoft Azure also provides a richset of data services for various storage scenarios.These data services enable storage, modificationand reporting on data in Microsoft Azure. Thefollowing components of Data Services can beused to store data: BLOBs that allow the stor-age of unstructured text or binary data (video,audio and images), Tables that can store largeamounts of unstructured non-relational (NoSQL)data, and the Azure SQL Database for storinglarge amounts of relational data, and HDInsight,which is a distribution of Apache Hadoop, basedon the Hortonworks Data Platform.

– Messaging: Messaging mechanisms allow foreffective communication between components

Fig. 2 Applicationdeployed to MicrosoftAzure cloud, which servesas a virtualizedinfrastructure, platform fordevelopers, and gateway forhosting applications

Fabric (virtualization layer)

Compute Data services Networking App Services

Application

user

virtual machines

Page 6: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

566 D. Mrozek et al.

and processes running in the whole cloud service.Queues are a general-purpose technology that canbe used for messaging in a wide variety of sce-narios, including: communication between weband worker roles in multi-tier Azure applications(as in our project), communication between on-premises applications and Azure-hosted applica-tions in hybrid solutions, communication betweencomponents of distributed applications runningon-premises in different organizations or depart-ments of an organization. Microsoft Azure pro-vides two types of queue mechanisms: AzureQueues and Service Bus Queues. Azure Queues,which are part of the Azure Storage infrastruc-ture, enable reliable, persistent messaging withinand between services. Service Bus queues are partof a broader Azure messaging infrastructure thatsupports queuing as well as other communicationpatterns.

– Fabric - the entire compute, storage (e.g. harddrives), and network infrastructure, usually imple-mented as virtualized clusters, constitutes aresource pool that consists of one or more servers(called scale units).

The basic tier of the Microsoft Azure platformallows us to create five classes of virtual machineswith different parameters and computational power(number of cores, CPU/core speed, amount of mem-ory, efficiency of I/O channel). Table 1 shows a list offeatures for available computing units.

Microsoft Azure is a commercial Cloud. Costsassociated with the use of the Azure platform dependon the number of resources used (including compute

units, storage, network traffic and bandwidth), thetime by which these resources are used, the number ofdeployed applications, and the usage of other servicesthat are enabled (e.g., HDInsight, Stream Analytics,Active Directory). The most popular and flexible pay-ment plan is based on pay-as-you-go subscriptions.

2 Methods

Cloud4PSP is an extensible service that allows for abinitio prediction of protein structures from amino acidsequences. It was intentionally designed to work in theAzure cloud. At the moment, Cloud4PSP uses only theWarecki-Znamirowski (WZ) method [66] as a sampleprocedure for protein structure prediction. However,the collection of prediction methods can be extendedby using available programming interfaces. The infor-mation on possible extensions will be provided inSection 2.4.

In terms of efficiency, the mentioned WZ methodis slower than popular optimization methods likegradient-based steepest descent [7], Newton-Raphson[12], Fletcher-Powell [17], or Broyden-Fletcher-Goldfarb-Shanno (BFGS) [59]. However, it finds aglobal minimum of the potential energy with higherprobability and has a lower tendency to converge onand get stuck in the nearest local minimum, whichis not necessarily the global minimum of the energyfunction [66]. The method is based on the modi-fied Monte Carlo approach, which is a better solu-tion. In the section, we give details of the predictionmethod, architecture and processing model used byCloud4PSP, and details of how the system is scaled in

Table 1 Available sizes ofMicrosoft Azure virtualmachines (VMs) for Weband Worker role instances[48]

VM/server Number of CPU core Memory Disk space Cost

type CPU cores speed (GHz) (GB) (GB) ($)/hr

ExtraSmall Shared core 1.0 0.768 19 0.02

Small 1 1.6 1.75 224 0.08

Medium 2 1.6 3.5 489 0.16

Large 4 1.6 7 999 0.32

ExtraLarge 8 1.6 14 2039 0.64

Page 7: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

Scaling Predictions of Protein Structures in Microsoft Azure Cloud 567

order to handle a growing amount of work related tothe prediction process.

2.1 Prediction Method

The Warecki-Znamirowski (WZ) method for proteinstructure prediction models a protein backbone by (1)sampling numerous protein conformations determinedby a series of φ and ψ torsion angles of the proteinbackbone (Fig. 3), and (2) finding the conformationwith the lowest potential energy, which minimizes theexpression:

minφ0,ψ0,φ1,ψ1,...,φn−1,ψn−1

E(φ0, ψ0, φ1, ψ1, ..., φn−1, ψn−1), (1)

where: E is a function describing the potentialenergy of the current conformation, n is the number

C

N

H

R0

C’

O

0

00

0

C

N

H

R1

C’

O

1

1

1

1

H

C2

N

H

R2

C’

2

22

H

0

1

Fig. 3 Overview of protein construction. Visible atoms form-ing the main chain of the polypeptide. Side chains are markedas R0, R1, R2

of amino acids in the predicted protein structure, andφi, ψi are torsion angles of the ith amino acid.

The basic assumption of the method is that the pep-tide bond is planar (see Fig. 3), which restricts therotation around the C′ − N bond, and the ω angleis essentially fixed to 180 degrees due to the partialdouble-bond character of the peptide bond. Therefore,the main changes of the protein conformation are pos-sible by chain rotations around the φ and ψ angles,and these angles provide the flexibility required forfolding the protein backbone.

The following potential energy function E is usedin the calculations of energy for the current conforma-tion (configuration AN of N atoms, determined by aseries of φ and ψ torsion angles in (1)):

E(AN) =bonds∑

j=1

kbj

2(dj − d0

j )2 +angles∑

j=1

kaj

2(θj − θ0

j )2

+torsions∑

j=1

V

2(1 + cos(pω − γ ))

+N∑

k=1

N∑

j=k+1

(4εkj

[(σkj

rkj

)12

−(

σkj

rkj

)6])

+N∑

k=1

N∑

j=k+1

qkqj

4πε0rkj, (2)

where: the first term represents the bond stretchingcomponent energy and: kb

j is a bond stretching forceconstant, dj is the distance between two atoms (realbond length), d0

j is the optimal bond length; the secondterm represents the angle bending component energyand: ka

j is a bending force constant, θj is the actual

value of the valence angle, θ0j is the optimal valence

angle; the third term represents torsional angle compo-nent energy and: V denotes the height of the torsionalbarrier, p is the periodicity, ω is the torsion angle, γ

is a phase factor; the fourth term represents van derWaals component energy described by the Lennard-Jennes potential and: rkj denotes the distance betweenatoms k and j , σkj is the collision diameter, εkj isthe well depth, and N is the number of atoms in thestructure AN ; the fifth term represents electrostaticcomponent energy and: qk , qj are atomic charges, rkjdenotes the distance between atoms k and j , ε0 is adielectric constant, and N is the number of atoms inthe structure AN .

Page 8: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

568 D. Mrozek et al.

The Warecki-Znamirowski (WZ) algorithm mini-mizes the expression (2) for current values of variablesdefining degrees of freedom, referential values of thevariables taken from force field parameter sets [37],and imposing constraints on torsion angles φ and ψ

resulting from Ramachandran plots [55]. These con-straints are derived based on the experimental observa-tions of possible values of torsion angles for particulartypes of amino acids published in [23].

The full algorithm of Warecki-Znamirowski con-sists of two phases:

– Phase I: Monte Carlo phaseIn the first phase, random values of φ and ψ tor-sion angles are generated for each amino acid. Thespace of possible values for torsion angles (360 ×360◦) is restricted based on the type of amino acidand the Ramachandran plot for the particular type[23]. Protein conformation is then defined by avector of φ and ψ torsion angles. When all anglesare generated for the given amino acid chain, theenergy of the conformation E is calculated. Thewhole process is repeated for a given amount oftries iT otal (the iT otal is specified by the user)and bc best solutions are collected in a dedicatedarray. The result of the Monte Carlo phase is thearray holding the best conformations of the givenamino acid chain. These conformations representapproximations of valid solutions that might stillneed some adjustment.

– Phase II: Angles Adjustment phaseThe adjustment is performed in the second phase.In this phase, each conformation stored in thearray of bc best conformations is examined onceagain. The torsion angles φi and ψi (i =0, 1, ..., n− 1) of each conformation are modifiedone by one by given values ±kφ and ±kψ .This simulates shaking the protein molecule andcontinuous motions of particular atoms, whereinthe amplitude of the motions is determined bythe ±φ and ±ψ values. The k parameter is apositive and decreasing parameter during the con-secutive steps of the minimization of the potentialenergy function. The shaking process is contin-ued until the k parameter reaches a fixed value.

After each modification of the torsion angle theconformational energy E′(AN) is calculated onceagain, and if it is lower than the current minimumE′(AN) < Emin, the modified conformation andits energy are accepted and saved as a current bestresult. The whole adjustment process is performedfor each of the bc best conformations.

The number of random tries iT otal that Phase Iis repeated for is defined by the user. In Ref. [66]Warecki and Znamirowski give the formula by whichusers can estimate the minimal number of iterationsthat are needed to model the molecule containing n

amino acids:

iT otal ≈ log(1 − α)

log(1 − ξn), (3)

where: α is the probability of finding the optimal solu-tion with the accuracy ξ ∈ (0; 1), which is a partof the range for each ith pair of torsional angles,i ∈ 〈0, ..., n − 1〉. For example, when modeling pro-tein containing 5 amino acids (n = 5) and assumingthe accuracy of localization of the optimal solution tobe half of the range for each pair of torsional anglesξ = 0.5 with probability α = 0.9, the number of iter-ations iT otal = 73. It is worth noting that with thesame assumptions iT otal = 2, 357 for n = 10 aminoacids, iT otal = 75, 450 for n = 15 amino acids, andiT otal = 2, 414, 435 for n = 20 amino acids. Thisshows that the number of possible conformations thatshould be explored grows quickly with the number ofamino acids.

The best conformation is the one that is character-ized by the lowest value of the potential energy E.Energies are calculated with the Tinker package [54]and AMBER96 [37] molecular mechanics potentialsand the generalized Born continuum solvation model(GBSA) in the presence of side chains.

2.2 Cloud4PSP Architecture

Cloud4PSP parallelizes the WZ method by the dis-tribution of the computations in the Cloud comput-ing. Cloud4PSP has been developed to work on theMicrosoft Azure public cloud. In the Microsoft Azure

Page 9: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

Scaling Predictions of Protein Structures in Microsoft Azure Cloud 569

cloud, applications can be delivered by deploying pre-configured virtual machines (IaaS) or as fully func-tional SaaS solutions. For the latter case, Microsoftprovides a dedicated application programming inter-face (API) for developers of any software applicationthat is intended to work on the Cloud. Internally,such an application is composed of a set of roles - aWeb role providing a graphical user interface (GUI)in the form of a web site, and Worker roles imple-menting the application logic. The roles reside in andexecute the application logic on virtual machines pro-vided by Microsoft, as a cloud provider. These virtualmachines come from Microsoft’s standard gallery ofvirtual machines and provide various computationalcapabilities depending on their size (see Table 1).

Roles working in the Cloud4PSP system create adistributed set of processing units PU defined asfollows:

PU = {rW } ∪ {rPM} ∪ RPW , (4)

where rW is the Web role responsible for the inter-action with Cloud4PSP users, rPM is the Prediction-Manager role distributing requests received from theWeb role and preparing the workload, and RPW isa set of m instances of the PredictionWorker roleperforming the prediction of protein structures inparallel.

The architecture of the Cloud4PSP consisting ofthe mentioned processing units is shown in Fig. 4.The Web role provides a front-end for users of thesystem, the PredictionManager role mediates the dis-tribution of the prediction process and stores theresults in the Azure SQL Database, and the Prediction-Worker roles predict the protein structure accordingto the logic of the chosen prediction method. Controlinstructions and parameters are transferred through

Web roleUser

SQL Azure

role

PredictionWorker roles

Input queue

Output queue

DB with prediction results

3D protein structures(PDB files)

PredictionInput

queue

Prediction Output queue

Azure BLOBs

Microsoft Azure public cloud

Prediction Manager

Fig. 4 Architecture of the Cloud4PSP - a cloud-based solution for ab initio protein structure prediction

Page 10: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

570 D. Mrozek et al.

appropriate queues. Roles have access to various stor-age resources, including Azure BLOBs (for PDBfiles with predicted 3D structures) and Azure SQLDatabase (for description of the results).

The Web role provides a Web site which allowsusers to interact with the entire system. Throughthe Web role users input the amino acid sequenceof the protein whose structure is to be predicted.They also choose the prediction method (only oneis available at the moment, but the system providesprogramming interfaces allowing us to bind newmethods), specify its parameters, e.g. the number ofiterations (random tries of the modified Monte Carlomethod), and provide a short description of the inputmolecule.

The PredictionManager role distributes the predic-tion process across many instances of the Predic-tionWorker role. In other words, PredictionManagerimplements the entire logic of how the prediction pro-cess is parallelized. This may depend on the predictionmethod used. Algorithm 1 describes how calculationsare scheduled and parallelized by the PredictionMan-ager role with part of the code adapted to the speci-ficity of the WZ prediction algorithm used. When theprediction is performed according to the WZ algo-rithm, the entire prediction process (in the system alsocalled a prediction job) consists of a vast number ofiterations (random selections of torsion angles) that,in the Cloud4PSP, are performed in parallel by manyinstances of the PredictionWorker role. The numberof iterations is specified by the user through the Webrole. In the configuration of the prediction job, theuser specifies the number of tasks (#tasks) and thenumber of tries performed in each task (#iterations).This gives flexibility in choosing the stopping crite-rion and flexibility in configuring prediction jobs (andin consequence, the number of candidate structuresafter the prediction, since in this implementation of theparallel prediction only the best conformations gen-erated in Phase I by each task are adjusted in PhaseII and these tuned conformations are returned). Thetotal number of iterations performed within the predic-tion job (iT otal) is a product of the number of tasks(#tasks) and the number of tries performed in eachtask (#iterations):

iT otal = #tasks × #iterations. (5)

The PredictionManager role creates a pool of #tasks

tasks (lines 6–7). Descriptions of these tasks, which

contain the information on the prediction algorithm,its parameters, random seed used to initialize a pseu-dorandom number generator for the modified MonteCarlo method, and the number of tries #iterations,are stored in the PredictionInput queue (lines 11–12).Prediction tasks are then consumed by instances of thePredictionWorker role.

Instances of the PredictionWorker role act accord-ing to Algorithm 2. They consume messages fromthe Prediction Input queue (lines 2–4), execute thechosen prediction method, generate successive proteinstructures, calculate potential energies for them (lines6–7), store protein structures in PDB files, and thensave these files in Azure BLOBs (line 8). The set ofPredictionWorker role instances RPW is defined asfollows:

RPW = {rPWi |i = 1, ..., m} (6)

where rPWi is a single instance of PredictionWorkerrole, and m is the number of PredictionWorker roleinstances working in the system.

Usually:

#tasks m. (7)

Steering instructions that control the entire systemand the course of action inside the Cloud4PSP aretransferred as messages through the queueing system.There are four queues present in the architecture of theCloud4PSP:

– Input queue - collects structure predictionrequests (Prediction Job descriptions) generatedby users through the Web site.

– Output queue - collects notifications of predic-tion completion for particular users (based on agenerated token number).

– Prediction Input queue - transfers parameters ofthe prediction process for a single instance of thePredictionWorker role, among others: amino acidsequence provided by the user, the number of iter-ations #iterations to be performed, parametersof the prediction method.

– Prediction Output queue - transfers descriptionsof results, e.g. PDB file name with protein struc-ture and potential energy value for the structure,from each instance of the PredictionWorker roleback to the PredictionManager role.

The Input queue and Prediction Input queue arevery important for buffering prediction requests and

Page 11: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

Scaling Predictions of Protein Structures in Microsoft Azure Cloud 571

steering the prediction process. The prediction pro-cess may take several hours and is performed asyn-chronously. A user’s requests are placed in the Inputqueue with the typical FIFO discipline and theyare processed when there are idle instances of thePredictionWorker role that can realize the request.

Many users can generate many prediction requests,so there must be a buffering mechanism for theserequests. Queues fulfill this task perfectly. One pre-diction request enqueued in the Input queue causesthe creation of many prediction orders (tasks) forinstances of the PredictionWorker role. The Predic-

Page 12: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

572 D. Mrozek et al.

Azure Storage

Web node PredictionManager node

Web rolePredictionManager

role

Prediction Job

Prediction Workernode

PredictionWorkerrole

1. Configure and execute

2. Request for execution

3. Initialize job

Prediction Task

Phase I: MC simulation

Phase II: Anglesadjustment

AzureSQL

Database

4. Create prediction tasks

5. Send a task for execution

6. Execute task

8. Confirm task execution and send description of results

9. Send description of results to DB

Azure BLOBs

7. Save predicted protein structures

11. Confirm the completion of execution

10. Finalize the job12. Retrieve results

Fig. 5 Cloud4PSP processing model showing components and objects involved in the prediction process, and the flow of messagesand data between components of the system

tionManager role consumes prediction requests fromthe Input Queue, generates many predictions tasks,each for performing the specified number of iterationsby PredictionWorker roles available in the system,and generates task description messages to the Predic-tion Input queue. The Prediction Output queue is usedby instances of the PredictionWorker role to return adescription of results to the PredictionManager role.Such a feedback allows the PredictionManager role tocontrol if all the prediction tasks have been completed.In the case of failure to realize of a prediction task,its completion can be delegated to another Prediction-Worker instance. The Output queue allows the Webrole to be notified that the whole prediction process isalready completed.

2.3 Cloud4PSP Processing Model

While building the Cloud4PSP, we have designed andmade available a dedicated processing model (Fig. 5)

that is used by the Cloud4PSP and can be used bydevelopers extending the system in the future. Inthe Cloud4PSP processing model, every predictionrequest generated by a user causes the creation of aPrediction Job (step 1). This is a special object withthe information on the entire prediction that must beperformed. This Prediction Job is submitted for exe-cution (step 2), i.e., it is serialized to the messagethat is placed in the Input queue. The PredictionMan-ager role consumes Prediction Jobs, initialize them(step 3) and, based on their description and avail-able resources, creates so-called Prediction Tasks (step4). These tasks are serialized to the messages thatare sent to the Prediction Input queue (step 5) andare then consumed by successive PredictionWorkerrole instances. PredictionWorker role instances exe-cute tasks according to the description contained in theretrieved message and using an appropriate predictionmethod (step 6), and finally save the predicted pro-tein structures in the Azure BLOB storage service

Page 13: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

Scaling Predictions of Protein Structures in Microsoft Azure Cloud 573

Fig. 6 Interfaces, theirimplementations forPredictionManager andPredictionWorker in theform of base classes, andinheritance for classesrepresenting a particularprediction method. Dum-myPredictionManager andDummyPrediction representthe sample test manager andprediction method, Monte-CarloPredictionManagerand MonteCarloPredictionreflect implementations ofthescheduling/parallelizationand prediction procedure forthe described WZ method

(step 7). After completing the Prediction Task, Predic-tionWorker confirms completion of the task and sendsa description of the generated protein structures tothe PredictionManager (step 8). The PredictionMan-ager saves the results in the database managed by theAzure SQL Database (step 9). By using an additionalregister stored in the database, the PredictionManagermaintains awareness of the progress of the PredictionJob execution. This emphasizes the importance of theconfirmations obtained in step 8. When the Prediction-Manager states that all tasks have been completed, itfinalizes the job by setting an appropriate property inthe job’s register in the database (step 10) and sendsthe notification to the Web role through the Outputqueue (step 11). The Output queue stores notifica-tions together with the token number generated forthe Prediction Job at the beginning of the predictionprocess, i.e., when submitting the job for execution.The Web role periodically checks the Output queue formessages reporting the completion of the PredictionJob and the statuses of prediction tasks in the database(step 12). Partial results (results of already finishedtasks) are retrieved from the database. In this way,

the user knows whether the prediction process iscompleted or is still in progress.

Such a processing model provides a kind of abstrac-tion layer for developers of the system, hides someimplementation details, e.g., implementation of com-munication and serialization of jobs and tasks, andfinally, provides a framework for future extensions ofthe system.

2.4 Extending Cloud4PSP

Although Cloud4PSP has been designed to workon the Microsoft Azure Cloud, its architecture andworkflow have a general character and can be alsoimplemented on other Clouds. This, however, wouldrequire adoption of the programming model specificto the Cloud provider and the tailoring of elementsof the architecture to available compute resources.The architecture of Cloud4PSP could also be adaptedand extended to hybrid clouds by combining on-premises compute resources and Microsoft Azurecloud resources. This would require some additionaleffort, related to moving the PredictionManager to

Page 14: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

574 D. Mrozek et al.

the on-premises cluster or private cloud and parcel-ing out prediction tasks across available resources.However, this would also allow for the use of pub-lic cloud resources to satisfy potentially increaseddemand especially for compute units. If the sys-tem needs some extra compute power, e.g., due toan excessively long and computationally demandingprediction process, part of the prediction job (someprediction tasks) could be scheduled for execution onthe Azure cloud. Then, after the prediction process iscompleted these commercial compute resources couldbe released.

Cloud4PSP has been developed in such a way that itallows developers to extend its functionality by addingnew modules. This is important, for example, in sit-uations when new prediction algorithms have to beadded to the system. This determines the openness ofthe system. Cloud4PSP has been mainly developedas a software, which places it in the SaaS layer ofthe cloud stack [47]. However, extensions are possi-ble not only by using the processing model presentedin Section 2.3, but also by a set of programming inter-faces provided for almost all components of the sys-tem. In Fig. 6 we show interfaces, their implementa-tions for PredictionManager (for parallelization logic)and PredictionWorker (for the prediction method) inthe form of base classes, and inheritance for classesrepresenting particular prediction methods. By usingthese base classes, advanced users and programmerscan plug in their own prediction methods (by inher-itance from the PredictionBase class) and their owncomputation scheduling algorithms (by inheritancefrom the PredictionManagerBase class)

2.5 Scaling the Cloud4PSP

The scalability of the system means that it is able tohandle a growing amount of work in a capable man-ner or is able to be enlarged to accommodate thatgrowth [5]. Cloud4PSP has this property. Cloud4PSPis a cloud-based, high performance and highly scal-able system for 3D protein structure prediction. Thescalability of the system is provided by the MicrosoftAzure cloud platform. The system can be scaled up(vertical scaling) or scaled out (horizontal scaling).Microsoft Azure allows us to combine both types ofscaling.

Cloud4PSP can be scaled out and scaled up dur-ing the protein structure prediction process. Scaling

out means changing the value of m in (6). Scaling upimplies the change of the size of the PredictionWorkerrole:

size(rPWi) ∈ {XS, S, M, L, XL}, (8)

where: XS denotes ExtraSmall size, S - Small, M -Medium, L - Large, XL - ExtraLarge. The sizes of thevirtual machines for PredictionWorker role instancesare described in Table 1 and are consistent with theparameters of the computation units provided.

In the Cloud4PSP we also made the followingassumption regarding all instances of the Prediction-Worker role:

∀rPWi ,rPWj ∈RPW ,i,j∈1,...,m,i �=j size(rPWi)=size(rPWj ). (9)

The assumption of the same size for all instances ofthe PredictionWorker role is motivated by the ease ofconfiguration and scaling of the system and by theease of analysis of its scalability.

The number of instances m of the PredictionWorkerrole depends on the configuration of the Cloud4PSP. Itcan be measured in thousands. The m is limited onlyby the user’s subscription for Microsoft Azure cloudresources and the number of compute units availablein the Azure cloud.

3 Results

Prediction of 3D protein structures from the begin-ning with the use of the WZ method requires manyMonte Carlo iterations, which takes some time. Thenumber of iterations and therefore the total executiontime depend on the size of the protein modeled, i.e.,the number of residues in the input sequence. Duringour tests we successfully modeled a 20 amino acidlong part of the nonstructural protein NSP1 moleculefrom the Semliki Forest virus with an RMSD of 0.99A.The NMR structure of the molecule (PDB ID code:1FW5) [40] from the Protein Data Bank is shown inFig. 7 (top). The structural alignment and superposi-tion of the modeled molecule and NMR molecule arepresented in Fig. 7 (bottom). Modeling the 20 aminoacid long part of the NSP1 protein was a sample com-putational procedure used for testing the efficiencyof the designed Cloud4PSP architecture in the real(Microsoft Azure) cloud environment. Tests were per-formed with the use of twenty CPU cores. Two ofthese twenty cores, i.e., two compute units of the

Page 15: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

Scaling Predictions of Protein Structures in Microsoft Azure Cloud 575

Fig. 7 Crystal structure of the part of the NSP1 molecule (PDBID code: 1FW5) [40] from the Semliki Forest virus that wasmodeled during experiments: (top left) protein skeletal structuremade by covalent bonds between atoms (sticks display mode in

Jmol [29]), (top right) secondary structure showing mainly α-helical character of this fragment (ribbon display mode in Jmol),(bottom) structural alignment and superposition of modeledmolecule and NMR molecule

Small size, were allocated to allow the activity of theWeb role and PredictionManager role. The remainingeighteen cores were used for the activity of the Predic-tionWorker role instances and to test the scalability ofthe system.

In particular, we have tested:

– the vertical scalability of the system - the effi-ciency of the prediction process depending on thesize of instances of the PredictionWorker role,

– the horizontal scalability of the system - the effi-ciency of the prediction process depending on thenumber of instances of the PredictionWorker role,

– the efficiency of the prediction process dependingon the number of prediction tasks and the numberof iterations performed by each task.

In all cases, the scalability has been determined onthe basis of execution time measurements. We havecarried out at least two replicas for each measure-ment. Then, the obtained results were averaged andin such a way they are presented in the followingsections. Averaged values of measurements were alsoused to determine n-fold speedups for both scalingtechniques.

3.1 Vertical Scalability

In the first phase of our tests, we wanted to checkwhich of the sizes for the compute units offered byMicrosoft Azure is most efficient. The basic tier forthe Web and Worker roles consists of the following

five sizes: ExtraSmall (XS), Small (S), Medium (M),Large (L), and ExtraLarge (XL). Their capabilitieswere described in Sect. 1.2 and briefly compared inTable 1. It is worth noting in Table 1 that beginningfrom the Small size up to the ExtraLarge size theamount of available memory and the number of coresdoubles gradually. The ExtraSmall size provides one,shared core and is generally used when testing appli-cations that should work on the Cloud, so as not togenerate unnecessary costs for the testing process.

While testing vertical scalability, we changed thesize of the PredictionWorker role from ExtraSmall(one, shared CPU core) to ExtraLarge (eight CPU

1,00 1,00

1,99

3,90

7,30

0

1

2

3

4

5

6

7

8

XS S M L XL

n-f

old

sp

eed

up

Size of PredictionWorker role

Fig. 8 Results of vertical scaling. Acceleration (n-foldspeedup) of the 3D protein structure prediction (prediction job)as a function of the size of instances of the PredictionWorkerrole

Page 16: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

576 D. Mrozek et al.

1:00

1:05

1:10

1:15

1:20

1:25

XS S M L XL

Tim

e (h

h:m

m)

Size of PredictionWorker role

Fig. 9 Results of vertical scaling. Average execution timefor a single prediction task as a function of the size of thePredictionWorker role

cores). Only one PredictionWorker was used in thistest. The Prediction job was configured to explore 80000 random conformations (iT otal) that were per-formed in 16 prediction tasks. Multiple predictiontasks were assigned to those instances of Predic-tionWorker that possessed more than one CPU core,proportionally to the number of cores (according tothe rule: one core - one task). The input sequence con-tained 20 amino acids of the NSP1 molecule. Eachtask was configured to generate 5 000 random proteinconformations (performed 5 000 iterations per task) inPhase I of the WZ method, and to tune only one bestconformation in Phase II of the method (φ = ψ =8 degrees for k = 3, 2, 1 in successive adjustmentiterations).

In Fig. 8 we show the n-fold speedup when scal-ing the system vertically. We observed that increasingthe size of the PredictionWorker role from Smallto Medium accelerated the prediction process almosttwo-fold. Above the Medium size (2 cores), thedynamics of the acceleration slowed down a bit - forLarge size (4 cores) the acceleration reached 3.90 andfor ExtraLarge size (8 cores) it reached 7.30.

When analyzing the results of tests, we deducedthat the slowdown in the acceleration dynamics wasan effect of two factors. The first factor was an over-head caused by the necessity of handling multiplethreads. For example, on ExtraLarge-sized Prediction-Workers we ran eight prediction tasks concurrently (8CPU cores = 8 threads = 8 prediction tasks). And thesecond factor was the cumulative I/O operations per-formed on a local hard drive by many threads of the

00:00:00

01:00:00

02:00:00

03:00:00

04:00:00

05:00:00

06:00:00

07:00:00

08:00:00

09:00:00

10:00:00

11:00:00

1 2 4 8 16 18

Tim

e (h

h:m

m:s

s)

# PredictionWorker role instances

Fig. 10 Results of horizontal scaling. Total execution time fora prediction job as a function of the number of instances of thePredictionWorker role (Small size)

PredictionWorker role. During computations, the Pre-dictionWorker role makes use of various executableprograms and additional files that must be read fromthe local hard drive. It also stores some intermediateresults on the hard drive. Multiple threads of the rolethat run on the compute unit multiply these read/writeoperations and interfere with each other. Therefore,the more threads were running on the compute unitduring our experiments, the more I/O operations wereperformed, which influenced the n-fold speedup. Thisis also visible when analyzing average execution timesfor prediction tasks, as shown in Fig. 9.

In Fig. 9 we can clearly see that execution of a sin-gle prediction task was the fastest when performed onSmall-sized instances and the slowest on ExtraLarge-sized instances of the PredictionWorker role. How-ever, ExtraLarge instances could host the executionof eight prediction tasks at the same time, whileSmall-sized instances could host only one. Therefore,the total execution time for the whole prediction jobperformed on ExtraLarge instances was 7.30 timesshorter than the same job performed on a single Smallinstance.

3.2 Horizontal Scalability

In the second series of our performance tests, wechecked the horizontal scalability of the Cloud4PSP.Tests were performed in the Microsoft Azure cloud.During this series of tests we increased the number ofinstances of the PredictionWorker role from one to 18.Instances of the PredictionWorker role were hosted oncompute units of the Small size (one CPU core). TheSmall-sized virtual machine represents the cheapest

Page 17: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

Scaling Predictions of Protein Structures in Microsoft Azure Cloud 577

1,002,04

3,98

7,45

13,08

14,52

0

2

4

6

8

10

12

14

16

1 2 4 8 16 18

n-f

old

sp

eed

up

# PredictionWorker role instances

Fig. 11 Results of horizontal scaling. Acceleration (n-foldspeedup) of the 3D protein structure prediction as a functionof the number of instances of the PredictionWorker role (Smallsize)

computational unit, providing one unshared core, andis the smallest sized unit recommended for productionworkloads [49]. Moreover, as proved by the experi-mental results shown in Fig. 9, the size guaranteed thefastest processing of prediction tasks.

While testing the horizontal scalability, we usedthe same input sequence as in the vertical scalabil-ity tests. The sequence contained 20 amino acids ofthe NSP1 enzyme. The prediction was configured toexplore 4 000 random conformations (iT otal) thatwere performed in 40 tasks. Each task was configuredto generate 100 random protein conformations (per-formed 100 iterations per task) in Phase I of the WZmethod, and to tune only one best conformation inPhase II of the method (φ = ψ = 8 degrees fork = 3, 2, 1 in successive angle adjustment iterations).

In Fig. 10 we show total execution time for thewhole prediction as a function of the number ofinstances of the PredictionWorker role. The resultsobtained proved that the total prediction time canbe significantly reduced when increasing the num-ber of compute units. Upon increasing the number ofinstances from one to 18, we reduced the predictiontime from 9 h and 31 min to only 39 min.

Based on the execution time measurements thatwe obtained during the performance tests, we calcu-lated the n-fold speedup for various configurations ofthe Cloud4PSP with respect to a single instance con-figuration. Figure 11 shows how the n-fold speedupchanges as a function of the number of instances of thePredictionWorker role.

We noticed that employing two instances ofthe PredictionWorker role increased the speed of

the prediction process more than two-fold. Addingmore instances of the role proportionally acceler-ated the process. Finally, the acceleration ratio (n-foldspeedup) reached the level of 14.52, when the num-ber of instances of the PredictionWorker role wasincreased from one to 18. A slowdown of the accel-eration dynamics was observed when using more thanfour compute units. This was caused by the unevenprocessing times of individual tasks – the average exe-cution time per task was 840 s, with minimum 340 sand maximum 1722 s. Moreover, we have to remem-ber that the execution times and n-fold speedup weremeasured for the whole system. And therefore, theexecution times covered not only the execution ofparticular prediction tasks, but also processing andstoring the results of these tasks by the Prediction-Manager role. Consequently, by increasing the degreeof parallelism we raised the pressure on the Pre-dictionManager, which serially processed incomingprediction results.

An interesting question is also why the process-ing times of tasks are so different. The answer liesin the prediction method itself, and specifically in itssecond phase. Phase I is fairly predictable in termsof the speed of generating random protein confor-mations and calculating energies for them (Fig. 12).The problem lies in the second phase, which can takemuch longer, depending on the length of the inputsequences, tuning parameters and the course of tuningitself, which depends on the conformation generatedin the first phase. We observed that for the tested inputsequence Phase I took 5 % and Phase II took 95 % ofthe whole execution time.

3.3 Influence of the Task Size

Since in Phase II of the WZ method angle adjust-ment is only performed for the structure with thelowest energy, while performing our experiments weexpected that the total execution time of the predic-tion process might depend on the configuration ofthe prediction tasks. We decided to verify how thetask size (i.e. the number of iterations) and entirejob configuration influence the whole prediction timeand how strong that influence is. For this purpose,we decided to explore 4 000 random conformations(iT otal) in the prediction process, for the same inputsequence as in the vertical and horizontal scalabil-ity tests. The sequence contained 20 amino acids of

Page 18: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

578 D. Mrozek et al.

Fig. 12 Efficiency of asingle compute unit in thefirst phase of the predictionprocess. Dependencybetween the execution timeand the number of randomlygenerated structures (MonteCarlo trials) for varioussizes of compute units

0:00:00

12:00:00

24:00:00

36:00:00

48:00:00

60:00:00

72:00:00

84:00:00

1 000 10 000 100 000 1 000 000

Tim

e (h

h:m

m:s

s)# Monte Carlo trials (generated structures)

XS & S

M

L

XL

the NSP1 molecule. We tested nine different config-urations of the prediction job. Leaving the number ofrandom conformations (iT otal) constant, we appro-priately selected the number of iterations per task andthe number of tasks. Phase II of the WZ method wasset up to tune only one best conformation for φ =ψ = 8 degrees for k = 3, 2, 1 in successive angleadjustment iterations. Tests were performed usingsixteen Small-sized instances of the PredictionWorkerrole.

The total execution times for various configura-tions of the prediction job are shown in Fig. 13.As can be observed, the most efficient is the con-figuration with the lowest number of tasks and thehighest number of iterations (16t × 250i). The pre-diction process for this configuration took 19 min and

25 s. The configuration with the largest number ofsmall tasks (800t × 5i) proved to be extremely inef-ficient. The prediction process for this configurationtook 11 h and 42 min. The reasons for such a behaviorof the system are the same as already discussed inthe previous section. The most time-consuming phaseof the WZ method is Phase II, which usually takes90–98 % of the task execution time. By configur-ing the system to use many small prediction tasks,we multiplied the execution of the longest phase,which resulted in a long execution time for the wholeprediction job.

For a small number of tasks it is advisable tochoose the number of tasks as a multiple of the numberof instances of the PredictionWorker role. For a largernumber of tasks (if # tasks > 2m) the rule does not

Fig. 13 Dependencybetween execution time andconfiguration of theprediction job(#tasks × #iterations) forthe constant number ofrandomly generatedstructures (Monte Carlotrials, iT otal = 4000) forsixteen instances of thePredictionWorker role

00:00:00

02:00:00

04:00:00

06:00:00

08:00:00

10:00:00

12:00:00

14:00:00

16t x250i

32t x125i

40t x100i

50t x80i

80t x50i

100t x40i

200t x20i

400t x10i

800t x5i

Tim

e (h

h:m

m:s

s)

Job configuration (#tasks x #iterations)

Page 19: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

Scaling Predictions of Protein Structures in Microsoft Azure Cloud 579

apply, since the execution times for particular tasks aredifferent.

It is difficult to explicitly and clearly define theconfiguration of the prediction job. It depends onmany factors, including the purpose of the predictionprocess, the size of the input molecule, and others.However, choosing a low number of tasks and a largenumber of iterations has the following consequences:

– the efficiency of the prediction process is higher,– users obtain fewer candidate structures after the

prediction is finished, and some good candidatescan be missed,

– fewer prediction results are stored in the databaseand fewer 3D structures are stored in BLOBs,which lowers the long-term costs of storage in theCloud,

– the cost of the computation process is lower dueto higher efficiency.

On the other hand, choosing a large number oftasks and a low number of iterations has the followingconsequences:

– the efficiency of the prediction process is lower,– users obtain more candidate protein structures

after the prediction process, which increases thechance to get the global minimum of the energyfunction,

– more prediction results are stored in the databaseand more candidate 3D structures are stored inBLOBs, which increases the long-term costs ofstorage in the Cloud,

– the cost of the computation process is higher dueto lower efficiency.

3.4 Scale Up, Scale Out or Combine?

Microsoft Azure allows us to scale out and scale upapplications and systems deployed in the Cloud. Thechoice is left to the developer and administrator of thesystem. In both cases, the system must be properlyimplemented to utilize the resources of the multi-corecompute unit (when scaling up) and to distribute, andthen process, parts of the main job by many instancesof the Worker role (when scaling out). The workloadassociated with the development of the system towardseach of the scaling types is difficult to express in num-bers. However, there are some technical reasons thatspeak in favor of particular scaling techniques. More-over, we have conducted a series of tests in order

to compare similar configurations of the Cloud4PSParchitecture and answer which of the configurations ismore efficient.

We tested various configurations of the Cloud4PSPby changing the size of the PredictionWorkerrole from ExtraSmall (one, shared CPU core) toExtraLarge (eight CPU cores) and by changing thenumber of instances of the role from one to eighteen(if possible, depending on the size of the computeunit). The prediction was configured to explore 100000 random conformations (iT otal) that were per-formed in 18 tasks. Multiple prediction tasks wereassigned to instances that possessed more than oneCPU core, proportionally to the number of cores(according to the rule: one core - one task). Theinput sequence contained 20 amino acids of the NSP1enzyme. Each task was configured to generate 5 000random protein conformations (performed 5 000 iter-ations per task) in Phase I of the WZ method, andto tune only one best conformation in Phase II of themethod (φ = ψ = 8 degrees for k = 3, 2, 1 insuccessive adjustment iterations).

We tested the following configurations of theCloud4PSP with respect to resource consumption bythe PredictionWorker role:

– one to eighteen ExtraSmall and Small computeunits,

– one to nine Medium compute units,– one to four Large compute units,– one to two ExtraLarge compute units.

In Fig. 14 we show the number of prediction taskscompleted within one hour as a function of the numberof instances of the PredictionWorker role for varioussizes of the instances. The results of the tests indicatea slight advantage of horizontal scaling over verti-cal scaling. Comparing both scaling techniques, wenoted that the number of prediction tasks completedwithin one hour by eight Small (1-core) instances washigher than those completed by one ExtraLarge (8-core) instance. Similar behavior was observed whencomparing four Small instances vs. one Large (4-core)instance, and two Small instances vs. one Medium(2-core) instance, but the differences were not so sig-nificant. We could draw the same conclusion based onthe charts showing n-fold speedups for vertical scaling(presented in Fig. 8) and horizontal scaling (presentedin Fig. 11). For example, the acceleration ratio was

Page 20: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

580 D. Mrozek et al.

slightly better for eight Small instances of the Predic-tionWorker role (7.45) than for a single, ExtraLarge(8-core) instance (7.30).

When combining both scaling techniques, ourresults again speak in favor of horizontal scaling withthe use of Small-sized instances, e.g., eighteen Smallinstances vs. nine Medium instances, sixteen Smallinstances vs. two ExtraLarge instances, eight Smallinstances vs. two Large instances, etc. The differencesare not significant, but they increase with the num-ber of simulation hours. We must also remember that,although in Fig. 14 we present measurements for par-ticular parameters of the prediction job, the entireexperiment reveals the relationships between particu-lar configurations of the Cloud4PSP. The actual exe-cution times depend on the size of the input sequence,the number of Monte Carlo trials, the number of tasks,and the parameters of the WZ method.

It should also be noted that horizontal scaling iseasier to implement. Once the Cloud4PSP system isdesigned to be scaled horizontally, scaling can be doneat any moment after the system is published in theCloud and only if the system requires higher resources(more compute units). Moreover, it can be done notonly manually by the administrator of the system, butalso automatically based on statistics of the systemperformance and resource usage indicators.

4 Discussion

Building Cloud computing services for the predic-tion of 3D protein structures, such as the presentedCloud4PSP, responds to the current demands for hav-ing widely available and scalable platforms solvingthese types of difficult problems. This is very impor-tant not only for scientists and research laboratoriestrying to predict a single 3D protein structure, but alsofor the biotechnology and pharmaceutical industryproducing new drug solutions for humanity.

Cloud4PSP is such a platform, allowing a highlyscalable prediction process in the Cloud, with all itsadvantages and drawbacks. Cloud4PSP was originallydesigned to be deployed in Microsoft’s commercialcloud, where computing resources can be dynamicallyallocated as needed. As we have shown, the predic-tion process can be scaled up by adding more powerfulcomputing instances, or scaled out by allocating morecomputing units of the same types. The results of our

experiments have shown that scaling out by addingmore Small-sized computing units turned out to bemore effective and to give more development flexibil-ity than scaling up (the acceleration rate for 8 coreswas 7.45 when scaling out and 7.30 when scaling up).However, in the case of big molecules being modeled,we may be forced to use larger compute units, e.g.Medium, Large, or ExtraLarge, and to combine hori-zontal and vertical scaling. Delivering Cloud4PSP asa service offloads bio-oriented companies from main-taining their own IT infrastructures, which can now beoutsourced from the cloud provider. This has anotherpositive consequence - a low barrier to entry. Even alaboratory or a company that cannot afford to possessa high-performance computational infrastructure canoutsource it in the Cloud and perform a prediction pro-cess starting with several computation units, then growup, if needed, and scale its systemo at any time.

On the other hand, the important aspects of process-ing data in the Cloud are privacy and data protection.As rightly pointed out by Gesing et al. [20], boththe input and output data of molecular simulationsare sensitive and may constitute a valuable intel-lectual property. Therefore, they need to be storedsecurely. Cloud security is still an evolving domainand defensive mechanisms are implemented on vari-ous levels of the functioning of the cloud-based sys-tem. In order to protect the systems deployed in theCloud, Microsoft Azure provides a reach set of secu-rity elements, including SSL mutual authenticationmechanisms while accessing various components ofthe system, encryption of data stored in Azure Storageand encryption of data transmission, firewalled andpartitioned networks to help protect against unwantedtraffic from the Internet, and many others. Many ofthe security elements are explicitly, and many of themimplicitly, used by Cloud4PSP, e.g., secure access toAzure BLOBs using HTTPS, secure authenticationswhen accessing SQL Azure, where prediction resultsare stored, SSL-based communication between inter-nal components. In terms of data privacy, Microsoft iscommitted to safeguarding the privacy of data and fol-lows the provisions of the E.U. Data Protection Direc-tive. Users of the Azure cloud, i.e., cloud applicationdevelopers, may specify the geographic areas wherethe data are stored and replicated for redundancy.Also, the distribution and deployment model of theCloud4PSP supports privacy and data protection. Onthe deployment level, various users and companies are

Page 21: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

Scaling Predictions of Protein Structures in Microsoft Azure Cloud 581

Fig. 14 Comparison of theefficiency of horizontal andvertical scaling, and thecombination of both scalingtechniques. The number ofprediction tasks completedwithin one hour as afunction of the number ofinstances of thePredictionWorker role forvarious sizes of theinstances. Measurementsfor ExtraSmall (XS) andSmall (S) sizes are similar 0

2

4

6

8

10

12

14

16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

#pre

dic

tio

n t

asks

per

ho

ur

# PredictionWorker instances

XS

S

M

L

XL

expected to establish their private Cloud4PSP-basedclusters in the Azure cloud for their own simulations.This ensures separation of data and computations per-formed by two companies, for which the predicted3D structures of proteins may be a highly-protectedintellectual property. Finally, on the application level,Cloud4PSP provides additional security mechanismof 128-bit GUID1-based tokens for protecting accessto results of prediction processes.

Comparing Cloud4PSP to other solutions men-tioned in the Introduction section, we can draw thefollowing conclusions. Firstly, Cloud4PSP, in terms ofwhat it offers to the end user, is a typical Softwareas a Service solution (SaaS), and also an extensiblecomputational framework, where advanced users mayadd their own prediction methods. This is the fun-damental feature that differentiates our cloud-basedsoftware from the Cloud BioLinux. Cloud BioLinuxprovides a virtual machine and, in the context of thepresented service stack, functions in the Infrastructureas a Service (IaaS) model, providing pre-configuredsoftware on a pre-configured Ubuntu operating systemand leaving the users the responsibility for maintain-ing the operating systems and configuring the bestway to scale the available applications. The secondfundamental difference between the two solutionsis the software used for structure prediction. CloudBioLinux is rich in various software packages, allow-ing us to perform many processes in the domainof bioinformatics. However, in terms of the protein

1GUID - Globally Unique Identifier

structure prediction, it represents a different approachto the problem. It relies on the PredictProtein andprovides tools for the prediction of secondary struc-tures of macromolecules, not 3D structures. UnlikeCloud BioLinux, Cloud4PSP is focused on ab initioprediction methods for tertiary structure.

Considering these two mentioned features, i.e. theservice model and prediction methods, Cloud4PSP ismore similar to Rosetta@Cloud and its AbinitioRe-lax application. In both systems prediction services areprovided in the SaaS model. Although both systemsuse different prediction methods, the implementedprediction methods belong to the same ab initio class.Both systems are prepared to be published to theCloud, giving users the ability to establish their pri-vate cloud-based clusters for 3D protein structureprediction, which is also important from the viewpointof data protection. Rosetta@Cloud builds a clusterof compute units on Amazon Web Services (AWS)and Cloud4PSP works on the Microsoft Azure cloud.There are also some similarities and differences inthe architectures of the two systems. Rosetta@Clouduses a so-called Master Node that controls the distri-bution of workload among Worker Nodes. The MasterNode in Rosetta@Cloud is a counterpart of the Pre-dictionManager role in Cloud4PSP, and Worker Nodesare counterparts of instances of the PredictionWorkerrole. However, Rosetta’s GUI for launching predic-tions is installed locally on the user’s computer asRosetta@Cloud Launcher, i.e., it works outside thecloud infrastructure, while the Cloud4PSP GUI isdesigned to be available through a web site (providedby the Web role working inside the cloud), and there-fore, users do not have to install anything locally.

Page 22: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

582 D. Mrozek et al.

Moreover, Cloud4PSP uses queues to buffer predic-tion requests. Applying queues has several advan-tages. Since queues provide an asynchronous messag-ing model, users of Cloud4PSP need not be onlineat the same time. Queues reliably store predictionrequests as messages until the Cloud4PSP is readyto process them. Then, Cloud4PSP can be adjustedand scaled out according to the current needs. As thedepth of the prediction request queue grows, morecomputing resources can be provisioned. Therefore,such an approach allows money to be saved, takinginto account the amount of infrastructure required toservice the application load. Moreover, the queue-based approach allows load balancing - as the numberof requests in a queue increases, more instances ofthe PredictionWorker role can be added to acceleratethe prediction. In both systems the prediction processcan be measured and end-users can be charged basedon some metrics, such as the amount of computinginstances, storage requirements, and data transfers.

On a critical note, we have to admit that, althoughCloud4PSP implements a relatively new algorithmfor 3D protein structure prediction, which is basedon Monte Carlo simulations and structure refinement,at the current stage of development, it is unable tocatch up Rosetta@Cloud in terms of the wealth ofdeployed software for predicting protein structures.However, the advantage of the WZ method used inthe Cloud4PSP is that it eliminates the drawbacks ofwidely used gradient-based methods and this brings itcloser to finding the global minimum of energy function.

Implementation of other prediction methodsremains in the perspective of system development inthe future. Advanced users may also extend the sys-tem by adding their own, new prediction methods. Wealso have to mention that our tests were performedfor the prediction of small molecular structures. Forlarge proteins, the number of iterations needed formodeling the structure increases exponentially, andthe process requires more computational resourcesthan those which we had during our tests.

5 Summary

Predicting protein structures from the beginning usingthe ab initio methods is one of the most challengingtasks for modern computational biology. This requireshuge computing resources to allow us to perform

calculations in parallel in order to reduce the pre-diction time. Cloud computing provides theoreticallyunlimited computing resources in a pay-as-you-gomodel. This is very important feature of the cloudarchitecture. Clouds provide an important comput-ing power that can be provisioned on-demand andaccording to particular needs, which perfectly fits thecharacter of the prediction process.

6 Availability

Cloud4PSP software is available at the projecthome page http://zti.polsl.pl/w3/dmrozek/science/cloud4psp.htm. The system can be used by the aca-demic society for free, i.e., the software is free, butusers have to lease the Cloud hardware infrastruc-ture to use the system for their own costs. Userswill have to configure and publish the system on theAzure cloud. Configuration details are provided at theCloud4PSP project home page.

Further development of the system will be car-ried out by the Cloud4Proteins non-profit, scien-tific group (http://www.zti.aei.polsl.pl/w3/dmrozek/science/cloud4proteins.htm).

Acknowledgments We would like to thank MicrosoftResearch for providing us with free access to the computationalresources of the Microsoft Azure cloud under the MicrosoftAzure for Research Award program.

Open Access This article is distributed under the terms of theCreative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unre-stricted use, distribution, and reproduction in any medium,provided you give appropriate credit to the original author(s)and the source, provide a link to the Creative Commons license,and indicate if changes were made.

References

1. Angiuoli, S., Matalka, M., Gussman, A., Galens, K.,et al.: CloVR: a virtual machine for automated andportable sequence analysis from the desktop using Cloudcomputing. BMC Bioinf. 12, 356 (2011)

2. Arnold, K., Bordoli, L., Kopp, J., Schwede, T.: The SWISS-MODEL workspace: a web-based environment for proteinstructure homology modelling. Bioinformatics 22(2), 195–201 (2006)

3. Berman, H., et al.: The Protein Data Bank. Nucleic AcidsRes. 28, 235–242 (2000)

4. Bertis, V., Bolze, R., Desprez, F., Reed, K.: From dedi-cated Grid to Volunteer Grid: large scale execution of a

Page 23: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

Scaling Predictions of Protein Structures in Microsoft Azure Cloud 583

bioinformatics application. J. Grid Comput. 7(4), 463–478(2009)

5. Bondi, A.: Characteristics of scalability and their impact onperformance. In: 2nd International Workshop on Softwareand Performance, WOSP 2000, pp. 195–203 (2000)

6. Case, D., Cheatham, T., Darden, T., Gohlke, H., Luo,R., Merz, K.J., Onufriev, A., Simmerling, C., Wang, B.,Woods, R.: The Amber biomolecular simulation programs.J. Comput. Chem. 26, 1668–1688 (2005)

7. Chen, C., Huang, Y., Ji, X., Xiao, Y.: Efficiently finding theminimum free energy path from steepest descent path. J.Chem. Phys. 138(16), 164122 (2013)

8. Chen, H.Y., Hsiung, M., Lee, H.C., Yen, E., Lin, S., Wu,Y.T.: GVSS: a high throughput drug discovery service ofAvian Flu and Dengue Fever for EGEE and EUAsiaGrid. J.Grid Comput. 8(4), 529–541 (2010)

9. Chivian, D., Kim, D.E., Malmstrom, L., Bradley, P.,Robertson, T., Murphy, P., Strauss, C.E., Bonneau, R., Rohl,C.A., Baker, D.: Automated prediction of CASP-5 struc-tures using the Robetta server. Proteins: Struct., Funct.,Bioinf. 53(S6), 524–533 (2003)

10. Cornell, W.D., Cieplak, P., Bayly, C.I., Gould, I.R., Merz,K.M., Ferguson, D.M., Spellmeyer, D.C., Fox, T., Caldwell,J.W., Kollman, P.A.: A second generation force field forthe simulation of proteins, nucleic acids, and organicmolecules. J. Am. Chem. Soc. 117(19), 5179–5197 (1995)

11. De Vries, S., van Dijk, A., Krzeminski, M., van Dijk, M.,Thureau, A., Hsu, V., Wassenaar, T., Bonvin, A.: HAD-DOCK versus HADDOCK: new features and performanceof HADDOCK2.0 on the CAPRI targets. Proteins 69, 726–733 (2007)

12. Edic, P., Isaacson, D., Saulnier, G., Jain, H., Newell, J.:An iterative Newton-Raphson method to solve the inverseadmittivity problem. IEEE Trans. Biomed. Eng. 45(7),899–908 (1998)

13. Emeakaroha, V.C., Maurer, M., Stern, P., Łabaj, P.P.,Brandic, I., Kreil, D.P.: Managing and optimizing bioin-formatics workflows for data analysis in clouds. J. GridComput. 11(3), 407–428 (2013)

14. Eswar, N., Webb, B., Marti-Renom, M.A., Madhusudhan,M., Eramian, D., Shen, M., Pieper, U., Sali, A.: Com-parative Protein Structure Modeling Using MODELLER.Wiley, New York (2007)

15. Farkas, Z., Kacsuk, P.: P-GRADE portal: a generic work-flow system to support user communities. Future Gener.Comput. Syst. 27(5), 454–465 (2011)

16. Ferrari, T., Gaido, L.: Resources and services of the EGEEproduction infrastructure. J. Grid Comput. 9, 119–133(2011)

17. Fletcher, R., Powell, M.: A rapidly convergent descentmethod for minimization. Comput. J. 6(2), 163–168 (1963)

18. Frishman, D., Argos, P.: Seventy-five percent accuracy inprotein secondary structure prediction. Proteins 27, 329–335 (1997)

19. Garnier, J., Gibrat, J., Robson, B.: GOR method for predict-ing protein secondary structure from amino acid sequence.Methods Enzymol. 266, 540–53 (1996)

20. Gesing, S., Grunzke, R., Kruger, J., Birkenheuer, G.,Wewior, M., Schafer, P., et al.: A single sign-on infras-tructure for science gateways on a use case for structuralbioinformatics. J. Grid Comput. 10, 769–790 (2012)

21. Gu, J., Bourne, P.: Structural Bioinformatics (Methods ofBiochemical Analysis), 2nd edn. Wiley-Blackwell, Hobo-ken (2009)

22. Herrmann, T., Guntert, P., Wuthrich, K.: Protein NMRstructure determination with automated NOE assignmentusing the new software CANDID and the torsion angledynamics algorithm DYANA. J. Mol. Biol. 319, 209–227(2002)

23. Hovmoller, S., Zhou, T., Ohlson, T.: Conformations ofamino acids in proteins. Acta Cryst. D58, 768–776 (2002)

24. Hung, C.L., Hua, G.J.: Cloud Computing for protein-ligandbinding site comparison. Biomed. Res. Int., 170356 (2013)

25. Hung, C.L., Lin, Y.L.: Implementation of a parallel pro-tein structure alignment service on Cloud. Int. J. Genomics439681, 1–8 (2008)

26. Hupfeld, F., Cortes, T., Kolbeck, B., Stender, J., Focht, E.,Hess, M., et al.: The XtreemFS architecture - a case forobject-based file systems in Grids. Concurrency Computat.:Pract. Exper. 20(17), 2049–2060 (2008)

27. Insilicos: Rosetta@Cloud: macromolecular modeling inthe Cloud. Fact Sheet. https://rosettacloud.files.wordpress.com/2012/08/rc-fact-sheet bp5-en2a.pdf (2012). Accessed9 March 2015

28. Jithesh, P., Donachy, P., Harmer, T., Kelly, N., Perrott,R., Wasnik, S., Johnston, J., McCurley, M., Townsley, M.,McKee, S.: GeneGrid: architecture, implementation andapplication. J. Grid Comput. 4(2), 209–222 (2006)

29. Jmol Homepage: Jmol: an open-source Java viewer forchemical structures in 3D. http://www.jmol.org. Accessed7 Sept 2015

30. Kacsuk, P., Farkas, Z., Kozlovszky, M., Hermann, G., Bal-asko, A., Karoczkai, K., Marton, I.: WS-PGRADE/gUSEgeneric DCI gateway framework for a large variety of usercommunities. J. Grid Comput. 10(4), 601–630 (2012)

31. Kajan, L., Yachdav, G., Vicedo, E., Steinegger, M., Mirdita,M., Angermuller, C., Bohm, A., Domke, S., Ertl, J., Mertes,C., Reisinger, E., Staniewski, C., Rost, B.: Cloud predic-tion of protein structure and function with PredictProteinfor Debian. BioMed Res. Int. 2013(398968), 1–6 (2013)

32. Kallberg, M., Wang, H., Wang, S., Peng, J., Wang, Z., Lu,H., Xu, J.: Template-based protein structure modeling usingthe RaptorX web server. Nat. Protoc. 7, 1511–1522 (2012)

33. Kanaris, I., Mylonakis, V., Chatziioannou, A., Maglogian-nis, I., Soldatos, J.: HECTOR: enabling microarray experi-ments over the hellenic Grid infrastructure. J. Grid Comput.7(3), 395–416 (2009)

34. Kelley, L., Sternberg, M.: Protein structure prediction onthe Web: a case study using the Phyre server. Nat. Protoc.4(3), 363–371 (2009)

35. Kessel, A., Ben-Tal, N.: Introduction to Proteins: Structure,Function, and Motion. Chapman & Hall/CRC Mathemat-ical & Computational Biology, CRC Press, Boca Raton(2010)

36. Kim, D., Chivian, D., Baker, D.: Protein structure predic-tion and analysis using the Robetta server. Nucleic AcidsRes. 32(Suppl 2), W526–31 (2004)

37. Kollman, P.: Advances and continuing challenges in achiev-ing realistic and predictive simulations of the propertiesof organic and biological molecules. Acc. Chem. Res. 29,461–469 (1996)

Page 24: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

584 D. Mrozek et al.

38. Krampis, K., Booth, T., Chapman, B., Tiwari, B., et al.:Cloud BioLinux: pre-configured and on-demand bioin-formatics computing for the genomics community. BMCBioinf. 13, 42 (2012)

39. Lagana, A., Costantini, A., Gervasi, O., Lago, N.F., Man-uali, C., Rampino, S.: COMPCHEM: progress towardsGEMS a grid empowered molecular simulator and beyond.J. Grid Comput. 8(4), 571–586 (2010)

40. Lampio, A., Kilpelainen, I., Pesonen, S., Karhi, K., Auvi-nen, P., Somerharju, P., Kaariainen, L.: Membrane bindingmechanism of an RNA virus-capping enzyme. J. Biol.Chem. 275(48), 37853–9 (2000)

41. Leach, A. Molecular Modelling: Principles and Applica-tions, 2nd edn. Pearson Education EMA, Essex (2001)

42. Leaver-Fay, A., Tyka, M., Lewis, S., Lange, O., Thompson,J., Jacak, R., et al.: ROSETTA3: an object-oriented softwaresuite for the simulation and design of macromolecules.Methods Enzymol. 487, 545–74 (2011)

43. Lesk, A. Introduction to Protein Science: Architecture,Function, and Genomics, 2nd edn. Oxford University Press,NY (2010)

44. Lewis, S., Csordas, A., Killcoyne, S., Hermjakob, H.,et al.: Hydra: a scalable proteomic search engine whichutilizes the Hadoop distributed computing framework.BMC Bioinf. 13, 324 (2012)

45. Lordan, F., Tejedor, E., Ejarque, J., Rafanell, R., Alvarez,J., Marozzo, F., Lezzi, D., Sirvent, R., Talia, D., Badia,R.M.: ServiceSs: an interoperable programming frameworkfor the cloud. J. Grid Comput. 12(1), 67–91 (2014)

46. McKendrick, J. Cloud computing market hot,but how hot? estimates are all over the map.http://www.forbes.com/sites/joemckendrick/2012/02/13/cloud-computing-market-hot-but-how-hot-estimates-are-all-over-the-map/ (2012). Accessed 24 Aug 2015

47. Mell, P., Grance, T.: The NIST definition of cloudcomputing. Special Publication 800-145. http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf (2011).Accessed 7 May 2015

48. Microsoft Azure Cloud Services Specification: Sizesfor Cloud Services. https://azure.microsoft.com/pl-pl/documentation/articles/cloud-services-sizes-specs/.Accessed 7 May 2015

49. Microsoft Azure Cloud Services Specification: Sizesfor virtual machines. https://azure.microsoft.com/pl-pl/documentation/articles/virtual-machines-size-specs/.Accessed 7 Sept 2015

50. Mrozek, D.: High-Performance Computational Solutions inProtein Bioinformatics. SpringerBriefs in Computer Sci-ence. Springer International Publishing (2014)

51. Mrozek, D., Kutyła, T., Małysiak-Mrozek, B.: Accelerat-ing 3D protein structure similarity searching on MicrosoftAzure Cloud with local replicas of macromolecular data.Parallel Processing and Applied Mathematics - PPAM2015, Lecture Notes in Computer Science. Springer, BerlinHeidelberg (2015)

52. Mrozek, D., Małysiak-Mrozek, B., Kłapcinski, A.:Cloud4Psi: cloud computing for 3D protein structuresimilarity searching. Bioinformatics 30(19), 2822–2825(2014)

53. Pierce, L., Salomon-Ferrer, R., de Oliveira, C., McCam-mon, J., Walker, R.: Routine access to millisecond time

scale events with accelerated molecular dynamics. J. Chem.Theory Comput. 8(9), 2997–3002 (2012)

54. Ponder, J.: TINKER - software tools for molecular design.Dept. of Biochemistry & Molecular Biophysics, Washing-ton University, School of Medicine, St. Louis (2001)

55. Ramachandran, G., Ramakrishnan, C., Sasisekaran, V.:Stereochemistry of polypeptide chain configurations. J.Mol. Biol. 7, 95–9 (1963)

56. Rost, B., Liu, J.: The PredictProtein server. Nucleic AcidsRes. 31(13), 3300–3304 (2003)

57. Schatz, M.C.: CloudBurst: highly sensitive read map-ping with MapReduce. Bioinformatics 25(11), 1363–1369(2009)

58. Schwieters, C., Kuszewski, J., Tjandra, N., Clore, G.: TheXplor-NIH NMR molecular structure determination pack-age. J. Magn. Reson. 160, 65–73 (2003)

59. Shanno, D.: On Broyden-Fletcher-Goldfarb-Shannomethod. J. Optimiz Theory Appl., 46 (1985)

60. Shaw, D.E., Dror, R.O., Salmon, J.K., Grossman, J.P.,Mackenzie, K.M., Bank, J.A., Young, C., Deneroff, M.M.,Batson, B., Bowers, K.J., Chow, E., Eastwood, M.P.,Ierardi, D.J., Klepeis, J.L., Kuskin, J.S., Larson, R.H.,Lindorff-Larsen, K., Maragakis, P., Moraes, M.A., Piana,S., Shan, Y., Towles, B.: Millisecond-scale moleculardynamics simulations on Anton. In: Proceedings of theConference on High Performance Computing Networking,Storage and Analysis, SC ’09, pp. 39:1–39:11. ACM, NewYork (2009)

61. Shen, Y., Vernon, R., Baker, D., Bax, A.: De novo pro-tein structure generation from incomplete chemical shiftassignments. J. Biomol. NMR 43, 63–78 (2009)

62. Shirts, M., Pande, V.: COMPUTING: screen savers of theworld unite! Science 290(5498), 1903–4 (2000)

63. Soding, J.: Protein homology detection by HMM-HMMcomparison. Bioinformatics 21(7), 951–960 (2005)

64. Streit, A., Bala, P., Beck-Ratzka, A., Benedyczak, K.,Bergmann, S., Breu, R., et al.: Unicore 6 - recent and futureadvancements. JUEL, 4319 (2010)

65. Van Der Spoel, D., Lindahl, E., Hess, B., Groenhof, G.,Mark, A., Berendsen, H.: GROMACS: fast, flexible, andfree. J. Comput. Chem. 26, 1701–1718 (2005)

66. Warecki, S., Znamirowski, L.: Random simulation of thenanostructures conformations. In: Proceedings of Inter-national Conference on Computing, Communication andControl Technology, The International Institute of Infor-matics and Systemics, Austin, Texas, vol. 1, pp. 388–393(2004)

67. Wassenaar, T.A., van Dijk, M., Loureiro-Ferreira, N., vander Schot, G., de Vries, S.J., Schmitz, C., van der Zwan, J.,Boelens, R., Giachetti, A., Ferella, L., Rosato, A., Bertini,I., Herrmann, T., Jonker, H.R., Bagaria, A., Jaravine, V.,Guntert, P., Schwalbe, H., Vranken, W.F., Doreleijers, J.F.,Vriend, G., Vuister, G., Franke, D., Kikhney, A., Svergun,D.I., Fogh, R.H., Ionides, J., Laue, E.D., Spronk, C., Jurksa,S., Verlato, M., Badoer, S., Dal Pra, S., Mazzucato, M.,Frizziero, E., Bonvin, A.M.: WeNMR: structural biology onthe Grid. J. Grid Comput. 10(4), 743–767 (2012)

68. Wu, S., Skolnick, J., Zhang, Y.: Ab initio modeling ofsmall proteins by iterative TASSER simulations. BMCBiol. 5(17) (2007)

Page 25: Scaling Ab Initio Predictions of 3D Protein Structures in ... · Keywords Bioinformatics ·Proteins ·3D protein structure ·Protein structure prediction ·Tertiary structure prediction

Scaling Predictions of Protein Structures in Microsoft Azure Cloud 585

69. Xu, D., Zhang, Y.: Ab initio protein structure assem-bly using continuous structure fragments and optimizedknowledge-based force field. Proteins 80(7), 1715–35(2012)

70. Xu, J., Li, M., Kim, D., Xu, Y.: RAPTOR: optimalprotein threading by linear programming, the inaugu-ral issue. J. Bioinform Comput. Biol. 1(1), 95–117(2003)

71. Yang, Y., Faraggi, E., Zhao, H., Zhou, Y.: Improvingprotein fold recognition and template-based modeling byemploying probabilistic-based matching between predicted

one-dimensional structural properties of query and cor-responding native properties of templates. Bioinformatics27(15), 2076–2082 (2011)

72. Zhang, Y.: Progress and challenges in protein structureprediction. Curr. Opin. Struct. Biol. 18(3), 342–348 (2008)

73. Znamirowski, L.: Non-gradient, sequential algorithm forsimulation of nascent polypeptide folding. In: Sunderam,V.S., van Albada, G.D., Sloot, P.M., Dongarra, J.J. (eds.)Computational Science - ICCS 2005, Lecture Notes inComputer Science, vol. 3514, pp. 766–774. Springer,Berlin Heidelberg (2005)